Process xml input with python regex

Question

I need to clean up an xml file before I can process it. The file has junk at the start and end, and then junk in between elements. Here is an example file:

junkjunkjunkjunk<root>
\par junkjunkjunkjunkjunk<level1>useful info to keep</level1>
</root>
junkjunkjunkjunk

How do I use regex to cut out (with replace?) the start and end junk, and then the middle junk? The middle junk always starts with "\par ...".

related: [Can I get lxml to ignore non-XML content before and after the root tag?](http://stackoverflow.com/q/15208543/4279) — jfs, Jun 24 '13 at 21:14
http://stackoverflow.com/questions/8577060/why-is-it-such-a-bad-idea-to-parse-xml-with-regex — Jan Matějka, Jun 24 '13 at 21:30

robjohncox · Accepted Answer · 2013-06-24T21:21:48.827

2

The following statements should remove the junk (assuming your document is stored in a variable called xml):

import re

xml = re.sub(r'.*<root>', '<root>', xml, flags=re.DOTALL)    # Remove leading junk
xml = re.sub(r'\\par[^<]*<', '<', xml)                       # Middle junk
xml = re.sub(r'</root>.*', '</root>', xml, flags=re.DOTALL)  # Trailing junk

Note that this assumes you know the name of the root element (and in this case, it's called root), otherwise you may need to adjust this slightly.

edited Jun 24 '13 at 21:21

answered Jun 24 '13 at 15:08

robjohncox

3,639
3
25
51

2

re.DOTALL flag is missing. – jfs Jun 24 '13 at 21:16

Process xml input with python regex

1 Answers1