How to prioritize regex | (OR) expressions?

Question

I'm trying to match kanji compounds in a Japanese sentence using regex.

Right now, I'm using / ((.)*) /to match a space delimited compound in, for example, 彼はそこにひと人でいた。

The problem is, that in some sentence the word is at the beginning, or followed with a punctuation characters. Ex. いっ瞬の間が生まれた。 or 一昨じつ、彼らはそこを出発した。

I've tried something like / ((.)*) |^((.)*) | ((.)*)、 etc. But this matches 彼はそこにひと人 instead of ひと人 in 彼はそこにひと人でいた。

Is there any way to pack all this in a single regex, or do I have to use one, check whether it returned anything, then try another one if not?

Thanks!

P.S.: I'm using PHP to parse the sentences.

`\b` should certainly work on Unicode. The problem is that PHP is typically **but not always** built with a version of PCRE that has been compiled not to work well with Unicode. Sometimes you can make it better with `//u`, but sometimes you cannot. If you did not personally, explicitly, and manually configure and compile your own dedicated build of the PCRE library **by hand** and then do the same thing all over again with your own special installation of PHP, you cannot rely on its regular expressions working reliably on Unicode. You need a different language if you want reliability. — tchrist, Aug 21 '11 at 14:03

Kamil Szot · Answer 1 · 2011-08-21T22:52:15.180

1

I think this: /([^ 、]+)/ should match the words in examples you've given (you may want to add some other word-terminating chars apart from space and 、 if you have them in your texts (or use \pL instead of [^ 、] to cover all UTF letters.

EXAMPLE

<?                                                                                                                                                          
preg_match_all('/[^ 、]+/u', "彼らは日本の 国民 となった。", $m);
print_r($m);

outputs

Array
(
    [0] => Array
        (
            [0] => 彼らは日本の
            [1] => 国民
            [2] => となった。
        )
)

edited Aug 21 '11 at 22:52

answered Aug 21 '11 at 12:51

Kamil Szot

17,436
6
62
65

When I use `/([^ 、]*)/` on `彼らは日本の国民となった。` it returns `彼らは日本の国民`, not `国民`. `/ ((.)*?) /` returns `国民` for `彼らは日本の国民となった。` (correct), but nothing for `いっ瞬の間が生まれた。` where the word is at the beginning. – Philip Seyfi Aug 21 '11 at 13:03

score 1 · Answer 2 · answered Aug 21 '11 at 17:16

Assuming your input is in UTF-8 you could try with

'/(\pL+)/u'

The \pL+ matches one or more letter in the string.

Example:

$str = '彼はそこに ひと人 でいた。';

preg_match_all('/(\pL+)/u', $str, $matches);

var_dump($matches[0]);

Output:

array(3) {
  [0]=>
  string(15) "彼はそこに"
  [1]=>
  string(9) "ひと人"
  [2]=>
  string(9) "でいた"
}

score 0 · Answer 3 · answered Aug 21 '11 at 13:07

0

you're trying only to split your string according to some pattern (white space, or punctuation), is that true?? what about this?

In [51]: word = '.test test\n.test'
In [53]: re.split('[\s,.]+',word)
Out[53]: ['', 'test', 'test', 'test']

answered Aug 21 '11 at 13:07

Sam Felix

1,329
1
10
23

score 0 · Accepted Answer · answered Aug 22 '11 at 09:44

After thinking about it for a long time I believe there's no way to parse the compounds without delimiting them all with spaces or any other characters which is what I'm doing now :)

Ex. if the sentence is 私はノート、ペンなどが必要だ。, there is no way for the computer to know whether it's 私は (start sentence & space delimited) or ノート (space & comma delimited) that is the right it should choose.

Thanks everyone for your suggestions...

How to prioritize regex | (OR) expressions?

4 Answers4