2

Using Python 2.7.3 on Linux. Here is a shell session verbatim.

>>> f = open("feed.xml")
>>> text = f.read()
>>> import re
>>> regexp1 = re.compile(r'</?item>')
>>> regexp2 = re.compile(r'<item>.*</item>')
>>> regexp1.findall(text)
['<item>', '</item>', '<item>', '</item>', '<item>', '</item>', '<item>', '</item>']
>>> regexp2.findall(text)
[]

Is this a bug, or is there something I'm not understanding about Python regular expressions?

Jangler
  • 43
  • 2

2 Answers2

5

By default, '.' does not match a newline. Try with

regexp2 = re.compile(r'<item>.*</item>', re.DOTALL)
chepner
  • 497,756
  • 71
  • 530
  • 681
0

Here is the best answer to this question: Don't use regular expressions to parse non-regular languages such as XML. It drove one S-O user insane. Another relevant link.

Community
  • 1
  • 1
Claudiu
  • 224,032
  • 165
  • 485
  • 680
  • 2
    This doesn't address his misunderstanding of regular expressions, however. – chepner Jul 30 '12 at 15:43
  • A valid point, but I'm only using this code for a quick hack and thus don't want or need to learn any new APIs. – Jangler Jul 30 '12 at 15:48
  • I finally followed the link to the insane S-O user. I'd retract my downvote for that if I could :) – chepner Jul 30 '12 at 16:00
  • @chepner: made a trivial (whitespace only) edit so you can retract the downvote. – Fred Foo Jul 30 '12 at 16:50
  • @Jangler: quick hacks often become scripts that you rely on. if you learn the new API then you can do a quick hack with the new API – Claudiu Jul 30 '12 at 20:02