Skip to content

src: use simdutf for two-byte string utf8 conversion in utf8 value#62248

Open
mertcanaltin wants to merge 1 commit intonodejs:mainfrom
mertcanaltin:mert/use-simdutf-for-two-byte-utf8
Open

src: use simdutf for two-byte string utf8 conversion in utf8 value#62248
mertcanaltin wants to merge 1 commit intonodejs:mainfrom
mertcanaltin:mert/use-simdutf-for-two-byte-utf8

Conversation

@mertcanaltin
Copy link
Member

@mertcanaltin mertcanaltin commented Mar 14, 2026

latin-1 already used simdutf #61696,

I changed WriteUtf8V2 to simdutf for two-byte strings.

➜  node git:(mert/use-simdutf-for-two-byte-utf8) βœ— node-benchmark-compare ./result.csv
                                                confidence improvement accuracy (*)   (**)  (***)
util/utf8-value.js n=5000000 type='ascii'                      -0.11 %       Β±2.64% Β±3.69% Β±5.18%
util/utf8-value.js n=5000000 type='mixed'              ***     49.60 %       Β±1.90% Β±2.67% Β±3.78%
util/utf8-value.js n=5000000 type='three_bytes'        ***     31.10 %       Β±2.42% Β±3.34% Β±4.60%
util/utf8-value.js n=5000000 type='two_bytes'          ***     42.75 %       Β±1.42% Β±1.94% Β±2.66%

Be aware that when doing many comparisons the risk of a false-positive result increases.
In this case, there are 4 comparisons, you can thus expect the following amount of false-positive results:
  0.20 false positives, when considering a   5% risk acceptance (*, **, ***),
  0.04 false positives, when considering a   1% risk acceptance (**, ***),
  0.00 false positives, when considering a 0.1% risk acceptance (***)

@nodejs-github-bot
Copy link
Collaborator

Review requested:

  • @nodejs/performance

@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. needs-ci PRs that need a full CI run. labels Mar 14, 2026
@mertcanaltin mertcanaltin force-pushed the mert/use-simdutf-for-two-byte-utf8 branch from f6d6f82 to d459594 Compare March 14, 2026 08:31
@codecov
Copy link

codecov bot commented Mar 14, 2026

Codecov Report

❌ Patch coverage is 90.90909% with 1 line in your changes missing coverage. Please review.
βœ… Project coverage is 89.68%. Comparing base (da5843b) to head (fb16b2d).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
src/util.cc 90.90% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main   #62248   +/-   ##
=======================================
  Coverage   89.67%   89.68%           
=======================================
  Files         676      676           
  Lines      206469   206507   +38     
  Branches    39537    39538    +1     
=======================================
+ Hits       185157   185199   +42     
+ Misses      13448    13447    -1     
+ Partials     7864     7861    -3     
Files with missing lines Coverage Ξ”
src/util.cc 87.41% <90.90%> (-0.03%) ⬇️

... and 35 files with indirect coverage changes

πŸš€ New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • πŸ“¦ JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

src/util.cc Outdated
size_t length = string->WriteUtf8V2(
isolate, target->out(), storage, String::WriteFlags::kReplaceInvalidUtf8);
target->SetLengthAndZeroTerminate(length);
size_t actual_length = simdutf::convert_utf16le_to_utf8(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, this could be utf16_to_utf8? I would test this on a big endian system to be sure.

Copy link
Member Author

@mertcanaltin mertcanaltin Mar 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for review, I changed to convert_utf16_to_utf8 which uses platform-native endianness, and updated bench result

@mertcanaltin mertcanaltin force-pushed the mert/use-simdutf-for-two-byte-utf8 branch 4 times, most recently from b8cc5e8 to dcbf493 Compare March 14, 2026 17:12
@mertcanaltin mertcanaltin force-pushed the mert/use-simdutf-for-two-byte-utf8 branch from dcbf493 to fb16b2d Compare March 14, 2026 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

c++ Issues and PRs that require attention from people who are familiar with C++. needs-ci PRs that need a full CI run.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants