2

I tried to use SSE to accelerate task:

In high level aspect:

string a = "^&a&*";
string b = "abcdef";
bool c = a_contain_any_alphabet_in_b(a, b);

More detail using SSE(pseudo code):

a = _mm_set_epi8('^', .....);
b = _mm_set_epi8('a', .....);
mask = _mm_cmpestrm (a, la, b, lb, imm8); // _SIDD_CMP_EQUAL_ANY toggle

... and then extract mask

My problem is what if my b contain more than 128 bits? The situation such as I want to check string a contain any alphabet(a~zA~Z) which recorded in b. But set of alphabets are 8*52 bits which greater than 128.

The naive approach I figured out is to separate b into many __mm128i.

mask1 = _mm_cmpestrm (a, la, b1, lb, imm8);
mask2 = _mm_cmpestrm (a, la, b2, lb, imm8);
...
and do some operation with all masks

I'm wondering are there any approach to do smarter?

Steven
  • 811
  • 4
  • 23
  • 1
    Do you always want to check for `a~zA~Z` or does this need to be customizable? – chtz Sep 17 '20 at 14:55
  • A contiguous range can be checked efficiently using a single sub and compare range-check trick. See my SIMD answer on [Convert a String In C++ To Upper Case](https://stackoverflow.com/a/37151084). My answer on [What is the idea behind ^= 32, that converts lowercase letters to upper and vice versa?](https://stackoverflow.com/a/54585515) shows a scalar version of detecting an alphabetic character. Just remove the `y |= 0x20` if you only care about one case, or keep it if you want to detect any alphabetic. – Peter Cordes Sep 17 '20 at 15:32
  • @chtz actually I also want to add comma into set – Steven Sep 17 '20 at 22:31
  • @Steven Please [edit](https://stackoverflow.com/posts/63940531/edit) your question to describe the actual problem you want to solve. It's still not clear to me if you always want to check for the same set. – chtz Sep 18 '20 at 09:07
  • Yes, I always want to check for the same set. This set contain `a~zA~Z ` and comma – Steven Sep 18 '20 at 15:01

0 Answers0