I have a file that has many xml-like elements such as this one:
<document docid=1>
Preliminary Report-International Algebraic Language
Perlis, A. J. & Samelson,K.
CACM December, 1958
</document>
I need to parse the docid and the text. What's a suitable regular expression for that?
I've tried this but it doesn't work:
collectionText = open('documents.txt').read()
docsPattern = r'<document docid=(\d+)>(.)*</document>'
docTuples = re.findall(docsPattern, collectionText)
EDIT: I've modified the pattern like this:
<document docid=(\d+)>(.*)</document>
This matches the whole document unfortunately not the individual document elements.
EDIT2: The correct implementation from Ahmad's and Acorn's answer is:
collectionText = open('documents.txt').read()
docsPattern = r'<document docid=(\d+)>(.*?)</document>'
docTuples = re.findall(docsPattern, collectionText, re.DOTALL)