-1
text = "a/NNP b/NNG c/NP d/NNP e/PNG" 

I want to take out words with only 'NNP' and 'NNG' tags.

So I tried:

words = re.compile('(\w+/[(NNP)|(NNG)]*)')
t = re.findall(words,text)

However, the result keep showing me

['a/NNP', 'b/NNG', 'c/NP', 'd/NNP','e/PNG'].
How can I get only ['a/NNP','b/NNG','d/NNP']?
Rakesh
  • 81,458
  • 17
  • 76
  • 113
S.joo
  • 71
  • 4
  • 1
    Use `re.compile(r'\w+/NN[PG]\b')` – Wiktor Stribiżew Jan 16 '19 at 07:49
  • Well that case was just an example. I have several tags such as [NNP|NNG|VV|VA|MAG|MAJ|IC|VX|MM], but some tags that are not in the list (for example, NP or VCP or VAX) keep appearing. Can I just get the tag in order? – S.joo Jan 16 '19 at 07:52
  • 1
    Then use `r'\w+/(?:NNP|NNG|all|other|alternatives|here)\b'`. See my answer. – Wiktor Stribiżew Jan 16 '19 at 07:56
  • I use https://rubular.com/ to test the behaviour of regexes to see immediate results. (Although it is "for ruby", it should do the trick here) – Thomas Junk Jan 16 '19 at 07:58

3 Answers3

5

You may use

import re

text = "a/NNP b/NNG c/NP d/NNP e/PNG" 
words = re.compile(r'\w+/(?:NNP|NNG)\b')
# OR words = re.compile(r'\w+/NN[PG]\b')
print(re.findall(words,text)) 
# => ['a/NNP', 'b/NNG', 'd/NNP']

See Python demo.

The regex is \w+/NN[PG]\b see this demo. It matches

  • \w+ - 1+ word chars (NOTE: to only match letters, replace \w+ with [^\W\d_]+)
  • /NN - /NN substring
  • (?:NNP|NNG) - a non-capturing group matching either NNP or NNG
  • [PG] - either P or G
  • \b - a word boundary (in order not to match /NNGGGG or whatever).
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thank you sooo much for your help. I never knew what non-capturing group is before. – S.joo Jan 16 '19 at 07:58
  • 1
    @S.joo You may read a lot about [non-capturing groups here](https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-what-does-do). – Wiktor Stribiżew Jan 16 '19 at 07:59
2

[] denotes a character class. It is not used to group together stuff, like it is used in maths.

You can use a non-capturing group (?:) in place of []:

\w+/(?:NNP|NNG)\b

If your strings always come in three-character triples, then there is no need for \b.

You can add as many options as you want:

\w+/(?:NNP|NNG|ABC|DEF|GHI)\b
Sweeper
  • 213,210
  • 22
  • 193
  • 313
1

I wouldn't say you need regex for that?

stuff = ('NNP', 'NNG')
text = "a/NNP b/NNG c/NP d/NNP e/PNG"
result = [i for i in text.split() if i.split("/")[1] in stuff]
# ['a/NNP', 'b/NNG', 'd/NNP']

The above is also more efficient than the regex counterpart and is easier to maintain:

>>> import re
>>>
>>> text = "a/NNP b/NNG c/NP d/NNP e/PNG"
>>> stuff = ('NNP', 'NNG', 'VV', 'VA', 'MAG', 'MAJ', 'IC', 'VX', 'MM')
>>>
>>> def regex(reg):
...     words = re.compile(reg)
...     return re.findall(words,text)
...
>>> def notregex():
...     return [i for i in text.split() if i.split("/")[1] in stuff]
...
>>> from timeit import timeit
>>> timeit(stmt="regex(a)", setup="from __main__ import regex; a=r'\w+/(?:NNP|NNG|VV|VA|MAG|MAJ|IC|VX|MM)\b'", number=100000)
0.3145495569999639
>>> timeit(stmt="notregex()", setup="from __main__ import notregex", number=100000)
0.21294589500007532
Jerry
  • 70,495
  • 13
  • 100
  • 144
  • Is there any reason, you are using a list and not a _set_ for this? – Thomas Junk Jan 16 '19 at 07:56
  • @ThomasJunk No specific reason other than I tend to default to lists more than sets – Jerry Jan 16 '19 at 07:57
  • @ThomasJunk: If you agree, that a set implies additional overhead, where in the question do you see the justification for it? I would actually be annoyed for a filter removing duplicates without asking for it. A more useful alternative would be a generator, to reduce memory footprint for a high number of results. – guidot Jan 16 '19 at 08:13
  • 1
    @guidot I think Thomas' suggestion was fair, that's why I changed the list to a set. In the question's context, strings following a specific pattern are being extracted, it means having two identical patterns would not add any additional benefit; and set lookup is faster than list lookup, so... – Jerry Jan 16 '19 at 08:44
  • @guidot I tend to disagree. The semantics of this is _having a bunch of uniqe criteria_ which is semantically _a set_. I admit, this is a _nitpick_. – Thomas Junk Jan 16 '19 at 08:51