2

I've encountered this strange phenomenon severeal times now. If I use an ifstream to feed a program with the content of a file and apply a regular expression to the incoming words, the German letters ä ö ü provide me with some difficulties. If any one of these appears at the begining of a word, the regular expression fails to recognize them, but not if any one of these letters appears within the word. So these lines

string word = "über";
regex check {R"(\b)" + word + R"(\b)", regex_constants::icase};
string search = "Es war genau über ihm.";

won't work because the regex fails to find über in the string search. However,

string word = "für";
regex check {R"(\b)" + word + R"(\b)", regex_constants::icase};
string search = "Es war für ihn.";

will work because the ü appears in the word. Why is that and how can I fix this? I've thought about replacing every ü by ue and every ä by ae and every ö by oe and later undo the replacement, but is there yet another possibility? I'm working with Visual Studio 2015.

AlexM
  • 325
  • 4
  • 11

1 Answers1

1

Use regex check {"(^|[\\x60\\x00-\\x2f\\x3a-\\x40\\x5b-\\x5e\\x7b-\\x7e])über($|[\\x60\\x00-\\x2f\\x3a-\\x40\\x5b-\\x5e\\x7b-\\x7e])", regex_constants::icase}; instead.

The default grammar of C++ regex is similar to JavaScript. \b doesn't support Unicode.

And from microsoft.com:

Word Boundary

A word boundary occurs in the following situations:

  • The current character is at the beginning of the target sequence and is one of the word characters A-Za-z0-9_.

  • The current character position is past the end of the target sequence and the last character in the target sequence is one of the word characters.

  • The current character is one of the word characters and the preceding character is not.

  • The current character is not one of the word characters and the preceding character is.

Community
  • 1
  • 1
cshu
  • 5,654
  • 28
  • 44
  • What is the meaning of these numbers? Do they stand for characters? – AlexM Feb 01 '17 at 13:31
  • @AlexM `[\\x60\\x00-\\x2f\\x3a-\\x40\\x5b-\\x5e\\x7b-\\x7e]` means any ascii character except `A-Za-z0-9_`. It matches all the common punctuation marks. – cshu Feb 01 '17 at 13:46
  • @AlexM note this regex regards all non-ascii characters in the same way as `A-Za-z0-9_`. It will treat special unicode punctuation in the same way as other letters. e.g. `。` and `ü` are treated in the same way. – cshu Feb 01 '17 at 13:51
  • I see. Is it possible to apply this thing to a regex and simultaneously use the name of the string in the regex instead of writing its content in the regex? Like I did above. It would be really nice if I could stick to that but my first attempt failed. – AlexM Feb 01 '17 at 13:58
  • @AlexM yes it should work. You mean concatenate and pass the `string` to the constructor of `regex`. Something like `regex check {prefix + word + suffix, regex_constants::icase};` – cshu Feb 01 '17 at 14:03
  • I've tried the following: Say we have an ifstream and its current content consists of three words and I want to check whether all these three words appear in the search string. So I split the current line given by the ifstream in three strings (with the names aw, bw and cw) via an istringstream and now I pass these strings to my regex which looked like this: regex att{ R"(\b)" + aw + R"(\b(.*?)\b)" + bw + R"(\b(.*?)\b)" + cw + R"(\b)", regex_constants::icase }; What I did then was replacing the stuff between the three string with the expression from above, but it doesn't behave as expected. – AlexM Feb 01 '17 at 14:11
  • @AlexM it probably does not do what you want. Because it doesn't match two words separated by a single punctuation mark. I guess you need something more specifically crafted. – cshu Feb 01 '17 at 14:33
  • This can't be the reason because (due to the program I'm writing) puncutation marks cant occur in the search string. It's really just about non-ASCII-signes. – AlexM Feb 01 '17 at 14:48
  • @AlexM I mean if the regex includes something like `foo([\\x60\\x00-\\x2f\\x3a-\\x40\\x5b-\\x5e\\x7b-\\x7e](.*?)[\\x60\\x00-\\x2f\\x3a-\\x40\\x5b-\\x5e\\x7b-\\x7e])bar`, it fails to match `foo bar`. For matching the consecutive, probably something like `foo([\\x60\\x00-\\x2f\\x3a-\\x40\\x5b-\\x5e\\x7b-\\x7e]+)bar`. – cshu Feb 01 '17 at 14:58