0

I am new to regex and just testing it out, my problem is after looking at examples my regex is matching the whole line almost instead of in between the tag.

re.findall(r'<i>(.*)</i>', 'test <i>abc</i> <i>def</i>')

['abc</i> <i>def']

Why is it not matching just between the tags given me abc def

user3079411
  • 35
  • 1
  • 5
  • For testing it out, this is fine. If you really want to parse HTML with regular expression, please see this post: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Hyperboreus Dec 08 '13 at 08:21

1 Answers1

3

You are using .* which is greedy. You want to add ? to the end of that making it non greedy.

>>> re.findall(r'<i>(.*?)</i>', 'test <i>abc</i> <i>def</i>')
['abc', 'def']

From the re documentation:

The *, +, and ? qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.> is matched against '<H1>title</H1>', it will match the entire string, and not just ''. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .? in the previous expression will match only ''.

hwnd
  • 69,796
  • 4
  • 95
  • 132
  • @fscore [Regular Expressions Tutorial](http://www.regular-expressions.info/tutorial.html). Good luck. – Steve P. Dec 08 '13 at 08:49