1

I am trying to parse open office XML. I am doing fine parsing with lxml in Python, but data i need to grab is located within a tag who's structure looks a bit funky to me.

<w:sdt Content> Dataaaaa </w:sdt>

Normally this would be fine, as there are many sdt tags. But the "Content" part is throwing me off. This code:

for element in tree.iter('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}sdt'):
    print(element.tag, element.attrib)

returns the below for multiple tags, so i don't know which is which:

{http://schemas.openxmlformats.org/wordprocessingml/2006/main}sdt {}

Also, don't worry about the namespacing, as i have this figured out. I am specifically just trying to access the tag above and the data within. :)

Hysii
  • 702
  • 2
  • 10
  • 23
  • 1
    That's not funky XML: it's not any kind of XML. Where does it come from? No XML-based tools are going to be able to handle this. – Michael Kay May 31 '18 at 00:11
  • Remove `Content` or change it to `Content=""` to make that XML well-formed. (You'll have to remove it to make it valid OOXML.) See duplicate link for further advice on parsing markup that resembles XML (but isn't). – kjhughes May 31 '18 at 01:11
  • @MichaelKay It comes from the xml in a Microsoft Word document – Hysii May 31 '18 at 12:56
  • My plan is automation, so i can not manually change the tag everytime. The link however, helps solves my problem with the xmlstarlet cmd utility. Thanks :) – Hysii May 31 '18 at 13:29

0 Answers0