2

I want to extract words from a string that contain specific character (/IN) until to other specific character (/NNP). My code so far (still not work):

import re

sentence = "Entah/RB kenapa/NN ini/DT bayik/NN suka/VBI banget/JJ :/: )/CP :/: )/CP :/: )/CP berenang/VBI di/IN Jln/NN Terusan/NNP Borobudur/NNP dan/NN di/IN Jalan/NN Perempatan/ADJ Malioboro/NNP"

tes = re.findall(r'((?:\S+/IN\s\w+/NNP\s*)+)', sentence)
print(tes)

So the sentence contain words di/IN Jln/NN Terusan/NNP Borobudur/NNP and di/IN Jalan/NN Perempatan/ADJ Malioboro/NNP that I like to extract. The expected result:

['di/IN Jln/NN Terusan/NNP Borobudur/NNP', 'di/IN Jalan/NN Perempatan/ADJ Malioboro/NNP']

What is the best way to do this task? thanks.

Chanda Korat
  • 2,453
  • 2
  • 19
  • 23
ytomo
  • 809
  • 1
  • 7
  • 23

1 Answers1

1

Use

r'\S+/IN\b(?:(?!\S+/IN\b).)+\S+/NNP\b'

See the regex demo

Details

  • \S+ - 1+ non-whitespace symbols
  • /IN\b - a /IN substring as a whole word
  • (?:(?!\S+/IN\b).)+ - any 1+ chars other than line break chars that do not match the \S+/IN\b pattern sequence (use re.DOTALL to match line breaks, too)
  • \S+/NNP\b - 1+ non-whitespaces, /NNP as a whole word (since \b is a word boundary)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Try : `sentence = "Entah/RB kenapa/NN ini/DT bayik/NN suka/VBI banget/JJ :/: )/CP :/: )/CP :/: )/CP berenang/VBI di/IN Jln/NN Terusan/NNP Borobudur d/NN /NNP dan/NN di/IN Jalan/NN Perempatan/ADJ Malioboro/NNP"` added `d/NN` between the first two `NNP`. – Mazdak Apr 13 '17 at 06:28
  • Still doesn't work. Beside that case it also doesn't cover the case that there is no string between `/IN` and `/NNp`. – Mazdak Apr 13 '17 at 06:33
  • @Kasramvd: There must be something between them, there will be at least a space, thus the `+` quantifier is correct. – Wiktor Stribiżew Apr 13 '17 at 06:34
  • I mean with space. It doesn't match that case. Test with `sentence = "Entah/RB kenapa/NN ini/DT bayik/NN suka/VBI banget/JJ :/: )/CP :/: )/CP :/: )/CP berenang/VBI di/IN Jln/NN Terusan/NNP Borobudur d/IN /NNP dan/NN di/IN Jalan/NN Perempatan/ADJ Malioboro/NNP" ` – Mazdak Apr 13 '17 at 06:36
  • Can a token be empty and classified as a noun? That is weird, but [`[^\s/]*/IN\b(?:(?![^\s/]*/IN\b).)+[^\s/]*/NNP\b`](https://regex101.com/r/S7PyAs/6) would cover that. – Wiktor Stribiżew Apr 13 '17 at 06:36
  • Also note that using `(?!\S+/IN\b).)+` for refusing of matching a string contain a particular sting is so inefficient by its own. No think what a monster would it be inside a repeated capture group :). – Mazdak Apr 13 '17 at 06:39
  • thanks @Kasramvd for the explanation. @wiktor-stribiżew in my case there will be no word only `/NNP` there must be a word in front of it. thanks by the way for the reference. – ytomo Apr 13 '17 at 06:40
  • Yes that will cover it. But now you have [Two problems](http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html). – Mazdak Apr 13 '17 at 06:41
  • Cool, use [`[^\s/]*/IN\b[^/]*(?:/(?!IN\b)[^/]*)*/NNP\b`](https://regex101.com/r/S7PyAs/7) if you want to speed it up. – Wiktor Stribiżew Apr 13 '17 at 06:42
  • @ytomo I see, this regex is clever and will cover almost all the cases (+), but note that in case you are dealing with larger data sets and need more performance it'd be a better idea to use a parser or go some other built-in functions. – Mazdak Apr 13 '17 at 06:44
  • @Kasramvd: You are free to post a non-regex solution. Why did you choose the regex in your deleted answer if you are so against a regex here? – Wiktor Stribiżew Apr 13 '17 at 06:44
  • @WiktorStribiżew I'm not against the regex I'm against the complicated regex. And if you were a pythonist you'd knew that `Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated.` That's why I deleted that answer because right after I recognized the real problem I preferred to not expand it. – Mazdak Apr 13 '17 at 06:48
  • I know that, I read it. But "beautiful" does not go hand in hand with regex if you need efficiency. – Wiktor Stribiżew Apr 13 '17 at 06:53
  • @ytomo: There is only one "complex" pattern in my solution: the `(?:(?!neg_lookahead).)+` [tempered greedy token](http://stackoverflow.com/questions/30900794/tempered-greedy-token-what-is-different-about-placing-the-dot-before-the-negat). When you unroll it, you will get [a more efficient, but uglier-looking regex](http://stackoverflow.com/questions/43384820/how-to-extract-some-words-pattern-from-string-using-regex-in-python/43384944?noredirect=1#comment73831971_43384944). – Wiktor Stribiżew Apr 13 '17 at 07:07