How to preg_match_all a set of words in any possible language?

Question

I have a website that people enter lists of words into.

These lists of words could be written in any language in the world.

How can I extract these lists of words from their input data if I do not know what language they are entering?

Is there some kind of match-all international alphabet symbol I am missing, or do I have to manually write up a set of brackets that will match every possible international letter?

Is this what I am looking for and just don't know it yet?

This is virtually impossible for languages that do not use clear word separators, like Chinese and Japanese: 言葉はどこからどこまででしょうね〜？ For these you *need* to know what language you're dealing with and use dictionary lookups to *guess* at the number of entered words. — deceze, Sep 05 '11 at 05:58
You are probably right. Looks like this is a whole can of worms larger than I ordered :/ — darkAsPitch, Sep 22 '11 at 10:47

Kobi · Accepted Answer · 2011-09-05T04:54:09.710

3

You can use Unicode character properties, for example:

preg_match_all('#[\p{L}\p{Pc}]+#u', $str, $matches);

[\p{L}\p{Pc}]+ gives you letters and connector punctuation. You can shorten that to \pL+.
Either way, you'd want to define "word" better. It is probably more than a sequence of some letters...

edited Sep 05 '11 at 04:54

answered Sep 05 '11 at 04:41

Kobi

135,331
41
252
292

I couldn't find anything specific to PHP or PCRE, but here's a good read on the subject (Java centric): http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions – Kobi Sep 05 '11 at 05:08
1

This does nothing at all for scripts which don't have these features. See deceze's comment. – tripleee Sep 05 '11 at 07:08
1

@tripleee - "does nothing at all" is simply wrong. `\pL` should match Unicode letters in every language: http://ideone.com/8sbr9 , and does match *letter* in Japanese, and not other symbols. Splitting the words correctly is *a whole other subject*, and isn't a simple task even with ASCII English letters, as it depends on context. (for example, `lions'`, `Mr.`, ect). – Kobi Sep 05 '11 at 07:24

score 2 · Answer 2 · answered Sep 05 '11 at 07:14

My recommendation is to define your own input convention - force them to input one word at a time, or one word per line in a textbox. Else, you will need a segmentation algorithm for each script (granted, it will be something trivial like "split on characters which have the Unicode word separator property" for the vast majority of scripts, but the remaining special cases are basically still open AI research topics).

How to preg_match_all a set of words in any possible language?

2 Answers2