Trying NOT to match a Japanese word using RegEx negative lookbehind

Question

The target structure looks like the following:

検索結果：１００，０００件

If I use the following regex pattern:

((?<!検索結果：)(?<!次の)(((〇|一|二|三|四|五|六|七|八|九|十|百|千|万|億|兆|京+|[0-9０-９]))(,|，|、)?).+((〇|一|二|三|四|五|六|七|八|九|十|百|千|万|億|兆|京|[0-9０-９]).+)件)(?!表示)

As you can see, I want to unmatch everything preceded by "検索結果：" & "次の" using this pattern followed by either Arabic numerals or Japanese kanji (Chinese character) numbers. However, the pattern somehow matches up to 4 digits but not 6 digits.

In other words,

次の１０００件

works (meaning it doesn't match anything), but

次の５，００００件

gives a partial match ("００００件")

I want to know why up to 4 digits. And ultimately want to find a way to NOT match anything using this regex. I know this regex is a bit messy. Thanks in advance for your feedback!

Are you looking for [`\p{N}+`](https://regex101.com/r/34kgDy/1/) ? Or the opposite, [`\P{N}+`](https://regex101.com/r/34kgDy/2) ? — Jan, Jan 15 '19 at 07:25
i see this related to Jan's response: https://stackoverflow.com/questions/14891129/regular-expression-pl-and-pn — Michael, Jan 15 '19 at 07:31
When you talk about regex, you always must state which language/regex engine you are using. — Tomalak, Jan 15 '19 at 07:41
Sorry - it's in a python script - Wiktor, I think your work does the job! I'll test some more and report back. Thanks in advance! — Michael, Jan 15 '19 at 09:28
Are you sure you want the `.+` terms? Which mean "match 1 or more of anything"? — Bohemian, Jan 15 '19 at 23:51
@WiktorStribiżew, I checked the regex but it didn't do well with other patterns. Here's the complete list of words that should and should not match. https://regex101.com/r/f1SybY/2 — Michael, Jan 16 '19 at 02:54
I see, `[０-９]` is not forming a word char. Use https://regex101.com/r/f1SybY/4. Or [a bit shorter](https://regex101.com/r/f1SybY/5). Or, for PCRE, [even shorter](https://regex101.com/r/f1SybY/6). — Wiktor Stribiżew, Jan 16 '19 at 08:13

score 2 · Accepted Answer · answered Jan 16 '19 at 10:02

You need to avoid matching the numbers after a digit or digit + the separator, so you need to add (?<![０-９0-9])(?<![０-９0-9][，,、]) right after (?<!次の):

(?<!検索結果：)(?<!次の)(?<![０-９0-9])(?<![０-９0-9][，,、])(?:[〇一二三四五六七八九十百千万億兆0-9０-９]|京+)[,，、]?.+[〇一二三四五六七八九十百千万億兆京0-9０-９].+件
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

See the regex demo.

score 0 · Answer 2 · answered Feb 07 '19 at 03:00

Here's one problem that I see so far:

販売実績100万件販売実績１００万件販売実績1,000件販売実績１，０００件販売実績1,000,000件です１００，０００件５０００件

These are all matching but it captures irrelevant part in between the two matching patterns. For instance,

販売実績100万件販売実績１００万件

as one string will match the part that's not supposed to match.

https://regex101.com/r/LfDPHE/1

Trying NOT to match a Japanese word using RegEx negative lookbehind

2 Answers2