Regular Expression: I have a problem with using '|'

Question

text = "a/NNP b/NNG c/NP d/NNP e/PNG"

I want to take out words with only 'NNP' and 'NNG' tags.

So I tried:

words = re.compile('(\w+/[(NNP)|(NNG)]*)')
t = re.findall(words,text)

However, the result keep showing me

['a/NNP', 'b/NNG', 'c/NP', 'd/NNP','e/PNG'].
How can I get only ['a/NNP','b/NNG','d/NNP']?

Well that case was just an example. I have several tags such as [NNP|NNG|VV|VA|MAG|MAJ|IC|VX|MM], but some tags that are not in the list (for example, NP or VCP or VAX) keep appearing. Can I just get the tag in order? — S.joo, Jan 16 '19 at 07:52
Then use `r'\w+/(?:NNP|NNG|all|other|alternatives|here)\b'`. See my answer. — Wiktor Stribiżew, Jan 16 '19 at 07:56
I use https://rubular.com/ to test the behaviour of regexes to see immediate results. (Although it is "for ruby", it should do the trick here) — Thomas Junk, Jan 16 '19 at 07:58

score 5 · Answer 1 · answered Jan 16 '19 at 07:51

5

You may use

import re

text = "a/NNP b/NNG c/NP d/NNP e/PNG" 
words = re.compile(r'\w+/(?:NNP|NNG)\b')
# OR words = re.compile(r'\w+/NN[PG]\b')
print(re.findall(words,text)) 
# => ['a/NNP', 'b/NNG', 'd/NNP']

See Python demo.

The regex is \w+/NN[PG]\b see this demo. It matches

\w+ - 1+ word chars (NOTE: to only match letters, replace \w+ with [^\W\d_]+)
/NN - /NN substring
(?:NNP|NNG) - a non-capturing group matching either NNP or NNG
[PG] - either P or G
\b - a word boundary (in order not to match /NNGGGG or whatever).

answered Jan 16 '19 at 07:51

Wiktor Stribiżew

607,720
39
448
563

Thank you sooo much for your help. I never knew what non-capturing group is before. – S.joo Jan 16 '19 at 07:58
1

@S.joo You may read a lot about [non-capturing groups here](https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-what-does-do). – Wiktor Stribiżew Jan 16 '19 at 07:59

score 2 · Accepted Answer · answered Jan 16 '19 at 07:54

[] denotes a character class. It is not used to group together stuff, like it is used in maths.

You can use a non-capturing group (?:) in place of []:

\w+/(?:NNP|NNG)\b

If your strings always come in three-character triples, then there is no need for \b.

You can add as many options as you want:

\w+/(?:NNP|NNG|ABC|DEF|GHI)\b

Jerry · Answer 3 · 2019-01-16T09:36:52.937

1

I wouldn't say you need regex for that?

stuff = ('NNP', 'NNG')
text = "a/NNP b/NNG c/NP d/NNP e/PNG"
result = [i for i in text.split() if i.split("/")[1] in stuff]
# ['a/NNP', 'b/NNG', 'd/NNP']

The above is also more efficient than the regex counterpart and is easier to maintain:

>>> import re
>>>
>>> text = "a/NNP b/NNG c/NP d/NNP e/PNG"
>>> stuff = ('NNP', 'NNG', 'VV', 'VA', 'MAG', 'MAJ', 'IC', 'VX', 'MM')
>>>
>>> def regex(reg):
...     words = re.compile(reg)
...     return re.findall(words,text)
...
>>> def notregex():
...     return [i for i in text.split() if i.split("/")[1] in stuff]
...
>>> from timeit import timeit
>>> timeit(stmt="regex(a)", setup="from __main__ import regex; a=r'\w+/(?:NNP|NNG|VV|VA|MAG|MAJ|IC|VX|MM)\b'", number=100000)
0.3145495569999639
>>> timeit(stmt="notregex()", setup="from __main__ import notregex", number=100000)
0.21294589500007532

edited Jan 16 '19 at 09:36

answered Jan 16 '19 at 07:55

Jerry

70,495
13
100
144

Is there any reason, you are using a list and not a _set_ for this? – Thomas Junk Jan 16 '19 at 07:56
@ThomasJunk No specific reason other than I tend to default to lists more than sets – Jerry Jan 16 '19 at 07:57
@ThomasJunk: If you agree, that a set implies additional overhead, where in the question do you see the justification for it? I would actually be annoyed for a filter removing duplicates without asking for it. A more useful alternative would be a generator, to reduce memory footprint for a high number of results. – guidot Jan 16 '19 at 08:13
1

@guidot I think Thomas' suggestion was fair, that's why I changed the list to a set. In the question's context, strings following a specific pattern are being extracted, it means having two identical patterns would not add any additional benefit; and set lookup is faster than list lookup, so... – Jerry Jan 16 '19 at 08:44
@guidot I tend to disagree. The semantics of this is _having a bunch of uniqe criteria_ which is semantically _a set_. I admit, this is a _nitpick_. – Thomas Junk Jan 16 '19 at 08:51

Regular Expression: I have a problem with using '|'

3 Answers3