1

I have a file that has many xml-like elements such as this one:

<document docid=1>
Preliminary Report-International Algebraic Language
Perlis, A. J. & Samelson,K.
CACM December, 1958
</document>

I need to parse the docid and the text. What's a suitable regular expression for that?

I've tried this but it doesn't work:

collectionText = open('documents.txt').read()
docsPattern = r'<document docid=(\d+)>(.)*</document>'
docTuples = re.findall(docsPattern, collectionText)

EDIT: I've modified the pattern like this:

<document docid=(\d+)>(.*)</document>

This matches the whole document unfortunately not the individual document elements.

EDIT2: The correct implementation from Ahmad's and Acorn's answer is:

collectionText = open('documents.txt').read()
docsPattern = r'<document docid=(\d+)>(.*?)</document>'
docTuples = re.findall(docsPattern, collectionText, re.DOTALL)
MarthyM
  • 1,839
  • 2
  • 21
  • 23
siamii
  • 23,374
  • 28
  • 93
  • 143
  • 1
    XML and Regex is two words I hate hearing together. – mauris Nov 15 '11 at 02:42
  • 1
    @thephpdeveloper, in general, you're right. But if it's XML-like format with known structure, regular expressions might be the easiest solution. – svick Nov 15 '11 at 03:21

3 Answers3

4

You need to use the DOTALL option with your regular expression so that it will match over multiple lines (by default . will not match newline characters).

Also note the comments regarding greediness in Ahmad's answer.

import re

text = '''<document docid=1>
Preliminary Report-International Algebraic Language
Perlis, A. J. & Samelson,K.
CACM December, 1958
</document>'''

pattern = r'<document docid=(\d+)>(.*?)</document>'
print re.findall(pattern, text, re.DOTALL)

In general, regular expressions are not suitable for parsing XML/HTML.

See:

RegEx match open tags except XHTML self-contained tags and http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

You want to use a parser like lxml.

Community
  • 1
  • 1
Acorn
  • 49,061
  • 27
  • 133
  • 172
4

Your pattern is greedy, so if you have multiple <document> elements it will end up matching all of them.

You can make it non-greedy by using .*?, which means "match zero or more characters, as few as possible." The updated pattern is:

<document docid=(\d+)>(.*?)</document>
Ahmad Mageed
  • 94,561
  • 19
  • 163
  • 174
  • Well spotted. This doesn't solve the problem of the expression needing to match over multiple lines though. – Acorn Nov 15 '11 at 03:16
  • @Acorn yeah I missed that, thinking the OP had that covered because it "matches the whole document." Good point though :) – Ahmad Mageed Nov 15 '11 at 03:21
1

Seems to work for .net "xml-like" structure just FYI...

<([^<>]+)>([^<>]+)<(\/[^<>]+)>
Mathias Müller
  • 22,203
  • 13
  • 58
  • 75
user2860427
  • 65
  • 1
  • 2
  • 5