0

I have the following regular expression

(?<=<TEXT>).*?(?=</TEXT>)

which is supposed to find anything between <TEXT> and </TEXT>.

I paste my string on http://pythex.org/ and it does work, but the following implementation in python does not find anything

import re
re.findall(r'(?<=<TEXT>).*?(?=</TEXT>)', text)

where text contains what I pasted into the window there (used the debugger, pasted output of variable). Do I need to pay attention to something special?

Some additional output

>>> pattern = re.compile(r"(?<=<TEXT>).*?(?=</TEXT>)")
>>> print(pattern)
re.compile('(?<=<TEXT>).*?(?=</TEXT>)')
>>> re.DOTALL
16
>>> pattern.findall(text)
[]
FooBar
  • 15,724
  • 19
  • 82
  • 171

2 Answers2

0

I get the "correct" output with

re.findall(r'(?<=<TEXT>).*?(?=</TEXT>)', text, re.DOTALL)

I assumed the default value in re to be the same with pythex, which it apparently is not.

FooBar
  • 15,724
  • 19
  • 82
  • 171
  • Dropping the ? after the .* changes the behaviour to greedily include everything from the first to the last . That said, it works for me with `re.findall(r'(?<=).*?(?=)', text, re.DOTALL)` – F1Rumors Feb 02 '16 at 17:11
0

It looks like you really ought to be considering a token parser rather than regular expressions - is this an xml or html input? In that case, the you might want to consider this question & the top answer here: How Do I Parse XML in Python

Community
  • 1
  • 1
F1Rumors
  • 920
  • 9
  • 13