How to parse out xml from noisy file using python

Question

I have a file which contains a bunch of logging information including xml. I'd like to parse out the xml portion into a string object so I can then run some xpaths on it to ensure to existence of certain information on the 'data' element.

File to parse:

Requesting event notifications... 
Receiving command objects... 
<?xml version="1.0" encoding="UTF-8"?><Root xmlns="http://schemas.com/service" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><data id="123" interface="2017.1" implementation="2016.122-SNAPSHOT" Version="2016.1.2700-SNAPSHOT"></data></Root>
All information has been collected 
Command execution successful...

Python:

import re

with open('./output.out', 'r') as outFile:
    data = outFile.read().replace('\n','')

regex = re.escape("<.*?>.*?<\/Root>");
p = re.compile(regex)
m = p.match(data)

if m:
    print(m.group())
else:
    print('No match')

Output:

No match

What am I doing wrong? How can I accomplish my goal? Any help would be much appreciated.

score 3 · Accepted Answer · answered Aug 28 '17 at 00:45

3

Thou shalt never use regular expressions for parsing XML/HTML. There is BeautifulSoup for this daunting task.

import bs4
soup = bs4.BeautifulSoup(open("output.out").read(), "lxml")
roots = soup.findAll('root')
#[<root xmlns="http://schemas.com/service" 
# xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
# <data id="123" implementation="2016.122-SNAPSHOT" interface="2017.1" 
# version="2016.1.2700-SNAPSHOT"></data></root>]

roots[0] is an XML document. You can do anything you want with it.

answered Aug 28 '17 at 00:45

DYZ

55,249
10
64
93

`BeautifulSoup` is python 2.7 compatible? It's ok if it's not. I like the title of the link you posted :) – barthelonafan Aug 28 '17 at 00:51
1

Yes it is 2.7 compatible. – DYZ Aug 28 '17 at 00:55

How to parse out xml from noisy file using python

1 Answers1