0

I'm using the docx package to extract elements from a word document and want to save them in a specific XML format.

doc = docx.Document("sample.docx")
paras = doc.paragraphs 

sample.docx contain headings, standard text, images, hyperlinks, lists and tables.

When I print out the different styles in the document, it seems like I can easily extract the headings and standard text components. i.e. The following giving me styles such as Heading, Normal, Body Text, Heading 2, Spacer, List paragraph etc.

for p in paras:
   print(p.style.name)

Can someone shed some light on how I can extract the following components?

  • Images: How to extract images? I found a similar answer here.
  • Hyperlinks: How to know a paragraph has a link in it?
  • Lists: Some lists are extracted as List paragraph style while other lists have not been extracted
  • Tables: I found that for tables, need to extract doc.tables. But then how do I maintain order of the elements in the original doc?
kami
  • 361
  • 3
  • 15
  • 1
    Search Google on "python-docx iter_block_items" to find some resources on the last one about tables in document-order. – scanny Nov 08 '19 at 16:01
  • 1
    Since you are trying to convert the various entities in the document like images and all into an xml. Try reading the word document as xml only and then convert that xml into whatever format of xml you want. Maybe this https://www.toptal.com/xml/an-informal-introduction-to-docx will help. Try reading the docx xml and then manipulate it according to your need – Shashank Shekhar Shukla Nov 15 '19 at 10:24

0 Answers0