When programming, we frequently have to flee strings. A typical way to take action would be to insert the backslash character () before some characters like the double quote. For instance, the string
my title is "La vie"
my title is "La vie"
A straightforward routine in C++ to flee a string might look the following:
for (...) (*in == '"')) *out++ = ''; *out++ = *in;
This type of character-by-character approach is unlikely to supply the perfect performance on modern hardware.
Recent Intel processors have fast instructions (AVX-512) which are perfect for such problems. I made a decision to sketch a remedy using Intel intrinsic functions. The routine goes the following:
- I take advantage of two constant registers containing 64 copies of the backslash character and 64 copies of the quote characters.
- I take up a loop by loading 32 bytes from the input.
- I expands these 32 bytes right into a 64 byte register, interleaving zero bytes.
- I copy these bytes with the quotes and backslash characters.
- From the resulting mask, Then i construct (by shifting and blending) escaped characters.
- I compress the effect, removing the zero bytes that appear prior to the unescaped characters.
- I advance the output pointer by the amount of written bytes and I continue the loop.
The C++ code roughly appears like this
__m512i solidus = _mm512_set1_epi8(''); __m512i quote = _mm512_set1_epi8('"'); for (; in + 32 <= finalin; in += 32) __m256i input = _mm256_loadu_si256(in); __m512i input1 = _mm512_cvtepu8_epi16(input); __mmask64 is_solidus = _mm512_cmpeq_epi8_mask(input1, solidus); __mmask64 is_quote = _mm512_cmpeq_epi8_mask(input1, quote); __mmask64 is_quote_or_solidus = _kor_mask64(is_solidus, is_quote); __mmask64 to_keep = _kor_mask64(is_quote_or_solidus, 0xaaaaaaaaaaaaaaaa); __m512i shifted_input1 = _mm512_bslli_epi128(input1, 1); __m512i escaped = _mm512_mask_blend_epi8(is_quote_or_solidus, shifted_input1, solidus); _mm512_mask_compressstoreu_epi8(out, to_keep, escaped); out += _mm_popcnt_u64(_cvtmask64_u64(to_keep));
This code could be greatly improved. Nevertheless, this is a good first step. Do you know the results an Intel icelake processor using GCC 11 (Linux) ? A straightforward benchmark indicates a 5x performance boost in comparison to a naive implementation:
|regular code||0.6 ns/character|
|AVX-512 code||0.1 ns/character|
It looks quite encouraging !My source code can be acquired. I need a recent x64 processor with AVX-512 VBMI2 support.