I'm using the docx package to extract elements from a word document and want to save them in a specific XML format.
doc = docx.Document("sample.docx")
paras = doc.paragraphs
sample.docx
contain headings, standard text, images, hyperlinks, lists and tables.
When I print out the different styles in the document, it seems like I can easily extract the headings and standard text components. i.e. The following giving me styles such as Heading, Normal, Body Text, Heading 2, Spacer, List paragraph
etc.
for p in paras:
print(p.style.name)
Can someone shed some light on how I can extract the following components?
- Images: How to extract images? I found a similar answer here.
- Hyperlinks: How to know a paragraph has a link in it?
- Lists: Some lists are extracted as
List paragraph
style while other lists have not been extracted - Tables: I found that for tables, need to extract
doc.tables
. But then how do I maintain order of the elements in the original doc?