Python Regex Findall Not Working As Expected

Question

I'm trying to capture the output of the Stanford CoreNLP dependency parser using a regex. I want to capture the dependency parse which spans several lines (everything between dependencies): and Sentence. A sample of the data:

Dependency Parse (enhanced plus plus dependencies):
root(ROOT-0, imply-5)
dobj(imply-5, what-1)
aux(imply-5, does-2)
det(man-4, the-3)
nsubj(imply-5, man-4)
advmod(mentions-8, when-6)
nsubj(mentions-8, he-7)
advcl(imply-5, mentions-8)
det(papers-10, the-9)
dobj(mentions-8, papers-10)
nsubj(written-13, he-11)
aux(written-13, has-12)
acl:relcl(papers-10, written-13)

Sentence #1 (10 tokens):

The code I'm using is:

regex = re.compile('dependencies\):(.*)Sentence', re.DOTALL)
found = regex.findall(text)

When I run, the code matches the whole text document rather than just the capture group. It works fine when I try it out on Regexr.

Help much appreciated

It seems to work for me: https://gist.github.com/Aankhen/043395db71552e98d80b19aedbeec4d2 — Aankhen, Jul 05 '18 at 10:21
If Rakesh's answer works for you, the only problem is that your real document contains multiple `Sentence`s after `dependencies):` and you need to get the leftmost `Sentence`s only, thus lazy matching is what you need, `.*?`. — Wiktor Stribiżew, Jul 05 '18 at 10:24
Yup, I should have made that more explicit in the question. I didn't realise that it was crucial in this instance :) — ggordon, Jul 05 '18 at 10:25

score 0 · Answer 1 · answered Jul 05 '18 at 10:16

Using re.findall(r"(?<=dependencies\):).*?(?=Sentence)", s, flags=re.DOTALL Lookbehind & Lookahead

Demo:

import re

s = """ Dependency Parse (enhanced plus plus dependencies):
root(ROOT-0, imply-5)
dobj(imply-5, what-1)
aux(imply-5, does-2)
det(man-4, the-3)
nsubj(imply-5, man-4)
advmod(mentions-8, when-6)
nsubj(mentions-8, he-7)
advcl(imply-5, mentions-8)
det(papers-10, the-9)
dobj(mentions-8, papers-10)
nsubj(written-13, he-11)
aux(written-13, has-12)
acl:relcl(papers-10, written-13)

Sentence #1 (10 tokens):"""

m = re.findall(r"(?<=dependencies\):).*?(?=Sentence)", s, flags=re.DOTALL)
print(m)

Output:

['\nroot(ROOT-0, imply-5)\ndobj(imply-5, what-1)\naux(imply-5, does-2)\ndet(man-4, the-3)\nnsubj(imply-5, man-4)\nadvmod(mentions-8, when-6)\nnsubj(mentions-8, he-7)\nadvcl(imply-5, mentions-8)\ndet(papers-10, the-9)\ndobj(mentions-8, papers-10)\nnsubj(written-13, he-11)\naux(written-13, has-12)\nacl:relcl(papers-10, written-13)\n\n']

With `findall`, you only need to use lookarounds if you need to get overlapping matches.`re.findall` only returns *captured* substrings, so no point using `(?<=` and `(?=` here. — Wiktor Stribiżew, Jul 05 '18 at 10:22

Python Regex Findall Not Working As Expected

1 Answers1