1

I would like to find a substring in UTF-8 string, case-insensitive. from what I read, the usual way it's done is case folding the strings in order to bring them to canonical form.

However, since case folding can change the length of the strings (and I don't want to change the length of the strings because I need to know what is the exact offset of the substring match in the original string), it seems I should be using Simple case mapping. although the case-insensitive comparison won't be accurate, it will be best effort.

However, I cannot find in ICU API functions that operate on strings with Simple case mapping. I can find Simple case mapping only for single char functions (u_foldCase() in uchar.h). Is there an option to use Simple case folding for whole strings?

rici
  • 234,347
  • 28
  • 237
  • 341
elad-ep
  • 333
  • 1
  • 9
  • Does this answer your question? [Case insensitive std::string.find()](https://stackoverflow.com/questions/3152241/case-insensitive-stdstring-find) – pptaszni Sep 07 '20 at 12:48
  • no, I'm actually not dealing with single substring search, but multiple ones using string match algorithms... but it's less relevant to the question so I didn't mention... – elad-ep Sep 07 '20 at 16:18
  • OK, then I don't understand the question. Can you show sample input and expected output? It should find multiple occurrences of one substring or single occurrences of many different substrings? Also ICU API (that has been moved to [github](https://unicode-org.github.io/icu/) that you mentioned is irrelevant here, as it works with Unicode characters. For UTF you don't need any additional API. – pptaszni Sep 08 '20 at 07:05
  • I'm finding dictionary (multiple strings) inside text using boyer moore and similar algorithms. why I don't need any API? I have UTF-8 words and UTF-8 text, and I want to search the words in the text - case insensitive. so I need to perform case mapping for both the text and the words... – elad-ep Sep 08 '20 at 07:41
  • Because the mapping is trivial. Only characters with values in range 65 - 90 (uppercase) should be considered equal to the characters in range 97-122 (lowercase). And that's it, mapping outside of that range is the identity, so you can just have a static 127 elements array to do the characters comparison, or simply write a `bool compare(char a, char b)` function with a condition inside, or use [std::tolower](https://en.cppreference.com/w/cpp/string/byte/tolower). That's why it is unclear what you are asking for, especially without any actual code provided. – pptaszni Sep 08 '20 at 08:17

0 Answers0