Can regular expressions work with different languages?

Question

English, of course, is a no-brainer for regex because that's what it was originally developed in/for:

Can regular expressions understand this character set?

French gets into some accented characters which I'm unsure how to match against - i.e. are è and e both considered word characters by regex?

Les expressions régulières peuvent comprendre ce jeu de caractères?

Japanese doesn't contain what I know as regex word characters to match against.

正規表現は、この文字を理解でき、設定？

I think this may also depend heavily on the platform on which the regex engine is running, did you have one in mind? — Lazarus, Mar 03 '10 at 13:56
"Regex", or "regular expression", is a concept defined for any collection of symbols you might want to call an alphabet. In practice, there are many regular expression engines (all of which I've seen add other capabilities as well), some of which presumably handle Unicode of some flavor well and some of which probably don't. In short, this is a platform-dependent question, and to get a useful response you will need to tell us which regex engine you're talking about. — David Thornley, Mar 03 '10 at 15:20

score 9 · Accepted Answer · edited Mar 03 '10 at 15:11

9

Short answer: yes.

More specifically it depends on your regex engine supporting unicode matches (as described here).

Such matches can complicate your regular expressions enormously, so I can recommend reading this unicode regex tutorial (also note that unicode implementations themselves can be quite a mess so you might also benefit from reading Joel Spolsky's article about the inner workings of character sets).

edited Mar 03 '10 at 15:11

Joachim Sauer

302,674
57
556
614

answered Mar 03 '10 at 13:56

Lars Tackmann

20,275
13
66
83

2

Note that Unicode is not the mess. It's all the attempts that came before it that makes the entire matter messy. – Joachim Sauer Mar 03 '10 at 14:03
1

By definition in that article, Unicode can't be a mess: implementations can be. – Tom Mar 03 '10 at 14:05

div-ane · Answer 2 · 2021-10-15T11:33:50.017

"[\p{L}]" This regular expression contains all characters that are letters, from all languages, upper and lower case. so letters like (a-z A-Z ä ß è 正の文字を理解) are accepted but signs like (, . ? > :) or other similar ones are not.

the brackets [] mean that this expression is a set.
If you want unlimited number of letters from this set to be accepted, use an astrix * after the brackets, like this: "[\p{L}]*"
it is always important to make sure you take care of white space in your regex. since your evaluation might fail because of white space. To solve this you can use: "[\p{L} ]*" (notice the white space inside brackets)
If you want to include the numbers as well, "[\p{L|N} ]*" can help. p{N} matches any kind of numeric character in any script.

I also find this very help full for different languages: https://medium.com/@h2s1880/how-to-use-regular-expressions-to-distinguish-national-languages-in-swift-c19d6d8d0a97 — div-ane, Oct 10 '20 at 13:08

score 1 · Answer 3 · answered Mar 03 '10 at 13:54

As far as I know, there isn't any specific pattern you can use i.e. [a-zA-Z] to match "è", but you can always match them in separately, i.e. [a-zA-Zè正]

Obviously that can make your regexp immense, but you can always control this by adding your strings into variables, and only passing the variables into the expressions.

score 1 · Answer 4 · answered Mar 03 '10 at 14:01

Generally speaking, regex is more for grokking machine-readable text than for human-readable text. It is in many ways a more general answer to the whole XML with regex thing; regex is by its very nature incapable of properly parsing human language, because the language is more complex than what you are using to parse it.

If you want to break down human language (English included), you would want to use a language analysis tool or even an AI, not mere regular expressions.

score 1 · Answer 5 · answered Mar 03 '10 at 15:05

1

/[\p{Latin}]/ should for example, include Latin alphabet. You can get the full explanation and reference here.

answered Mar 03 '10 at 15:05

casraf

21,085
9
56
91

1

That's a useful-looking site, but it concentrates on Perl and similar regex engines. It isn't universal. – David Thornley Mar 03 '10 at 15:25
hmm yeah, I'm not sure what engine the asker uses, but maybe it's useful? Perl RegEx engine is used widely – casraf Mar 03 '10 at 15:40

score 0 · Answer 6 · answered Mar 03 '10 at 13:55

0

it is not about the regular expression but about framework that executes it. java and .net i think are very good in handling unicode. so "è and e both considered word characters by regex" is true.

answered Mar 03 '10 at 13:55

Andrey

59,039
12
119
163

score 0 · Answer 7 · answered Mar 03 '10 at 13:58

0

It depends on the implementation and the character set. In general the answer is "Yes," but it may require additional setup on your part.

In Perl, for example, the meaning of things like \w is altered by the chosen locale (use locale).

answered Mar 03 '10 at 13:58

sorpigal

25,504
8
57
75

score 0 · Answer 8 · edited May 23 '17 at 12:13

0

This SO thread might help. It includes the Unicode character classes you can use in a regex (e.g., [Ll] is all lowercase letters, regardless of language).

edited May 23 '17 at 12:13

Community

1
1

answered Mar 03 '10 at 14:05

Tom

22,301
5
63
96

Use in a regex in what engine? Perl? Boost? Java? – David Thornley Mar 03 '10 at 15:26
6.2L V8. What other kind is there? – Tom Mar 03 '10 at 16:06

Can regular expressions work with different languages?

8 Answers8

Linked