Generic regex umlaut solution?

Question

Is there a generic (non-)word regex that covers all mutations of characters on this globe? I am developing an application that should handle all languages. Technically I want to split sentences by words. Splitting them by nonword characters (\W) splits by 'ä' too. A static workaround is not an option since and explicitely covering all mutations on this world (éçḮñ and thousands more) is impossible.

Why don't you split by `\s` ? Could you provide input & desired output ? — Thomas Ayoub, Jan 19 '16 at 12:04
So you want to split `it's` into `it` and `s`? Wouldn't it make sense to split on whitespace and non-connecting punctuation? At any rate, you definitely need to tell us which regex engine you're using. — Tim Pietzcker, Jan 19 '16 at 12:20
No its a C++/Qt application. I want to index words. So splitting by space is not optimal too, because of punctuation marks. But a static set of separators is indeed a better approach. This solves almost my problem, thank you, but not the SO question. — ManuelSchneid3r, Jan 19 '16 at 12:31
Do you mean "umlaut" (which is specifically the mark in the German characters ä, ö‚ and ü) or any accent (ie ̈ , ̂,etc) or any accented character (Ö, ê, ñ etc)? — Martin Bonner supports Monica, Jan 19 '16 at 13:16
I mean any diacritic signs or more general any mutations that do not fit in `[A-Za-z]` but are word characters (in a natural interpretation) — ManuelSchneid3r, Jan 19 '16 at 13:29
There are languages where the characters look like they have diacritic marks but are in fact considered actual letters in the alphabet. I think in Swedish for example, the letters `å`, `ä` and `ö` are considered distinct letters, not letters with diacritic marks (I could be wrong there but I remember there is a language like that). — dreamlax, Jan 19 '16 at 13:34

score 2 · Answer 1 · answered Jan 19 '16 at 12:50

I can't give you something that will work on all languages because I don't know enough languages to judge whether there will be edge cases.

My suggestion:

Split on whitespace (\s+).
Trim punctuation characters from start/end of each "word" you got in step 1 (replace ^\p{P}+|\p{P}+$ with nothing - the QRegularExpression docs say that it supports Unicode fully, so there's hope this will work)

Unless you care about preserving punctuation in examples like This is Charles' car, this should go a long way without removing punctuation within words like it's or Marne-sur-Seine.

Generic regex umlaut solution?

1 Answers1