0

I need to clean up an xml file before I can process it. The file has junk at the start and end, and then junk in between elements. Here is an example file:

junkjunkjunkjunk<root>
\par junkjunkjunkjunkjunk<level1>useful info to keep</level1>
</root>
junkjunkjunkjunk

How do I use regex to cut out (with replace?) the start and end junk, and then the middle junk? The middle junk always starts with "\par ...".

user2104778
  • 992
  • 1
  • 14
  • 38
  • related: [Can I get lxml to ignore non-XML content before and after the root tag?](http://stackoverflow.com/q/15208543/4279) – jfs Jun 24 '13 at 21:14
  • http://stackoverflow.com/questions/8577060/why-is-it-such-a-bad-idea-to-parse-xml-with-regex – Jan Matějka Jun 24 '13 at 21:30

1 Answers1

2

The following statements should remove the junk (assuming your document is stored in a variable called xml):

import re

xml = re.sub(r'.*<root>', '<root>', xml, flags=re.DOTALL)    # Remove leading junk
xml = re.sub(r'\\par[^<]*<', '<', xml)                       # Middle junk
xml = re.sub(r'</root>.*', '</root>', xml, flags=re.DOTALL)  # Trailing junk

Note that this assumes you know the name of the root element (and in this case, it's called root), otherwise you may need to adjust this slightly.

robjohncox
  • 3,639
  • 3
  • 25
  • 51