parse xml like document with regex

Question

I have a file that has many xml-like elements such as this one:

<document docid=1>
Preliminary Report-International Algebraic Language
Perlis, A. J. & Samelson,K.
CACM December, 1958
</document>

I need to parse the docid and the text. What's a suitable regular expression for that?

I've tried this but it doesn't work:

collectionText = open('documents.txt').read()
docsPattern = r'<document docid=(\d+)>(.)*</document>'
docTuples = re.findall(docsPattern, collectionText)

EDIT: I've modified the pattern like this:

<document docid=(\d+)>(.*)</document>

This matches the whole document unfortunately not the individual document elements.

EDIT2: The correct implementation from Ahmad's and Acorn's answer is:

collectionText = open('documents.txt').read()
docsPattern = r'<document docid=(\d+)>(.*?)</document>'
docTuples = re.findall(docsPattern, collectionText, re.DOTALL)

@thephpdeveloper, in general, you're right. But if it's XML-like format with known structure, regular expressions might be the easiest solution. — svick, Nov 15 '11 at 03:21

score 4 · Answer 1 · edited May 23 '17 at 11:53

You need to use the DOTALL option with your regular expression so that it will match over multiple lines (by default . will not match newline characters).

Also note the comments regarding greediness in Ahmad's answer.

import re

text = '''<document docid=1>
Preliminary Report-International Algebraic Language
Perlis, A. J. & Samelson,K.
CACM December, 1958
</document>'''

pattern = r'<document docid=(\d+)>(.*?)</document>'
print re.findall(pattern, text, re.DOTALL)

In general, regular expressions are not suitable for parsing XML/HTML.

See:

RegEx match open tags except XHTML self-contained tags and http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

You want to use a parser like lxml.

this isn't XML, just similar. I only need the regex for this one file — siamii, Nov 15 '11 at 02:49

score 4 · Accepted Answer · answered Nov 15 '11 at 03:06

4

Your pattern is greedy, so if you have multiple <document> elements it will end up matching all of them.

You can make it non-greedy by using .*?, which means "match zero or more characters, as few as possible." The updated pattern is:

<document docid=(\d+)>(.*?)</document>

answered Nov 15 '11 at 03:06

Ahmad Mageed

94,561
19
163
174

Well spotted. This doesn't solve the problem of the expression needing to match over multiple lines though. – Acorn Nov 15 '11 at 03:16
@Acorn yeah I missed that, thinking the OP had that covered because it "matches the whole document." Good point though :) – Ahmad Mageed Nov 15 '11 at 03:21

score 1 · Answer 3 · edited Oct 25 '16 at 20:32

1

Seems to work for .net "xml-like" structure just FYI...

<([^<>]+)>([^<>]+)<(\/[^<>]+)>

edited Oct 25 '16 at 20:32

Mathias Müller

22,203
13
58
75

answered Jan 21 '14 at 22:22

user2860427

65
1
2
5

parse xml like document with regex

3 Answers3