Funky XML tag parse

Question

I am trying to parse open office XML. I am doing fine parsing with lxml in Python, but data i need to grab is located within a tag who's structure looks a bit funky to me.

<w:sdt Content> Dataaaaa </w:sdt>

Normally this would be fine, as there are many sdt tags. But the "Content" part is throwing me off. This code:

for element in tree.iter('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}sdt'):
    print(element.tag, element.attrib)

returns the below for multiple tags, so i don't know which is which:

{http://schemas.openxmlformats.org/wordprocessingml/2006/main}sdt {}

Also, don't worry about the namespacing, as i have this figured out. I am specifically just trying to access the tag above and the data within. :)

That's not funky XML: it's not any kind of XML. Where does it come from? No XML-based tools are going to be able to handle this. — Michael Kay, May 31 '18 at 00:11
Remove `Content` or change it to `Content=""` to make that XML well-formed. (You'll have to remove it to make it valid OOXML.) See duplicate link for further advice on parsing markup that resembles XML (but isn't). — kjhughes, May 31 '18 at 01:11
@MichaelKay It comes from the xml in a Microsoft Word document — Hysii, May 31 '18 at 12:56
My plan is automation, so i can not manually change the tag everytime. The link however, helps solves my problem with the xmlstarlet cmd utility. Thanks :) — Hysii, May 31 '18 at 13:29

Funky XML tag parse

0 Answers0