Is there a generic (non-)word regex that covers all mutations of characters on this globe? I am developing an application that should handle all languages. Technically I want to split sentences by words. Splitting them by nonword characters (\W) splits by 'ä' too. A static workaround is not an option since and explicitely covering all mutations on this world (éçḮñ and thousands more) is impossible.
Asked
Active
Viewed 242 times
2
-
1So, it is JavaScript? Use XRegExp `[^\pL]` or `\PL`. – Wiktor Stribiżew Jan 19 '16 at 12:04
-
1Why don't you split by `\s` ? Could you provide input & desired output ? – Thomas Ayoub Jan 19 '16 at 12:04
-
So you want to split `it's` into `it` and `s`? Wouldn't it make sense to split on whitespace and non-connecting punctuation? At any rate, you definitely need to tell us which regex engine you're using. – Tim Pietzcker Jan 19 '16 at 12:20
-
No its a C++/Qt application. I want to index words. So splitting by space is not optimal too, because of punctuation marks. But a static set of separators is indeed a better approach. This solves almost my problem, thank you, but not the SO question. – ManuelSchneid3r Jan 19 '16 at 12:31
-
Do you mean "umlaut" (which is specifically the mark in the German characters ä, ö‚ and ü) or any accent (ie ̈ , ̂,etc) or any accented character (Ö, ê, ñ etc)? – Martin Bonner supports Monica Jan 19 '16 at 13:16
-
I mean any diacritic signs or more general any mutations that do not fit in `[A-Za-z]` but are word characters (in a natural interpretation) – ManuelSchneid3r Jan 19 '16 at 13:29
-
There are languages where the characters look like they have diacritic marks but are in fact considered actual letters in the alphabet. I think in Swedish for example, the letters `å`, `ä` and `ö` are considered distinct letters, not letters with diacritic marks (I could be wrong there but I remember there is a language like that). – dreamlax Jan 19 '16 at 13:34
1 Answers
2
I can't give you something that will work on all languages because I don't know enough languages to judge whether there will be edge cases.
My suggestion:
- Split on whitespace (
\s+
). - Trim punctuation characters from start/end of each "word" you got in step 1 (replace
^\p{P}+|\p{P}+$
with nothing - the QRegularExpression docs say that it supports Unicode fully, so there's hope this will work)
Unless you care about preserving punctuation in examples like This is Charles' car
, this should go a long way without removing punctuation within words like it's
or Marne-sur-Seine
.

Tim Pietzcker
- 328,213
- 58
- 503
- 561