How do I parse a sentence using regex

Question

I need to parse a sentence like: "Alice is a boy." into ['Alice', 'boy'] and and "An elephant is a mammal." into ['elephant', 'mammal']. Meaning I need to split the string by 'is' while also remove 'a/an'. Is there an elegant way to do it?

Sounds like you need to [remove stopwords](http://stackoverflow.com/questions/5486337/how-to-remove-stop-words-using-nltk-or-python) and get the rest by simple splitting. — Wiktor Stribiżew, Apr 30 '17 at 22:06

score 0 · Answer 1 · answered Apr 30 '17 at 22:09

0

This answer does not make us of regex, but is one way of doing things:

s = 'Alice is a boy'
s = s.split() # each word becomes an entry in a list
s = [word for word in s if word != 'a' and word !='an' and word !='is']

The main downside to this is that you would need to list out every word you want to exclude in the list comprehension.

answered Apr 30 '17 at 22:09

Deem

7,007
2
19
23

What about the example `An elephant is a mammal.`? Also, you forgot the full-stop. – Peter Wood May 01 '17 at 06:15
Simpler is `word for word in s if not word in {'a', 'an', 'is'}` – Peter Wood May 01 '17 at 06:18
That's true, this method does not account for the full stop, oops. One could account for that using the `translate` method in the string package. – Deem May 01 '17 at 19:21

score 0 · Accepted Answer · answered Apr 30 '17 at 22:57

If you insists on using a regex, you can do it like this by using re.search:

print(re.search('(\w+) is [a|an]? (\w+)',"Alice is a boy.").groups())
# output: ('Alice', 'boy')

print(re.search('(\w+) is [a|an]? (\w+)',"An elephant is a mammal.").groups())
# output: ('elephant', 'mammal')
# apply list() if you want it as a list

How do I parse a sentence using regex

2 Answers2