Parse Word Document in Python

Question

i wanted to convert a word document to text. So i used a script.

import win32com.client 

app = win32com.client.Dispatch('Word.Application') 
doc = app.Documents.Open(r'C:\Users\SBYSMR10\Desktop\New folder (2)\GENERAL DATA.doc') 
content=doc.Content.Text
app.Quit()
print content

i have the folllowing result:

enter image description here

Now i want to convert this text into a list which contains all its items. I used

content = " ".join(content.replace(u"\xa0", " ").strip().split())

EDIT

When i do that, i get :

enter image description here

Its not a list. What is the problem? What is that big dot character?

What didn't work? What did you get? – Noufal Ibrahim Dec 27 '11 at 08:08 — Noufal Ibrahim, Dec 27 '11 at 08:08
What does "it didnt work" mean? – eumiro Dec 27 '11 at 08:09 — eumiro, Dec 27 '11 at 08:09

score 9 · Accepted Answer · answered Dec 27 '11 at 08:33

9

Word documents aren't text, they are documents: They have control information (like formatting) and text. If you ignore the control information, the text is pretty useless.

So you have to dig into the details how to navigate the control structure of the document to find the texts that you're interested in and then get the text content of that structures.

Note: You'll find that Word is very complex. If you can, consider these two approaches as well:

Save the Word document as HTML from within Word. It'll lose some formatting but lists will stay intact. HTML is much more simple to parse and understand than Word.
save the document as OOXML (exists at least since Office 10, the extension is .docx). This is a ZIP archive with XML documents inside. The XML is again easier to parse/understand than the full Word document but harder than the HTML version.

answered Dec 27 '11 at 08:33

Aaron Digulla

321,842
108
597
820

if my data is allways followed by a known string inside word, how can i fetch it, then? – Shansal Dec 27 '11 at 08:47
Your problem are the list items. Example in HTML: `
- a
- b
`. The text of this is `ab`. How do you know which characters belong to which item if you ignore the document structure?
– Aaron Digulla Dec 27 '11 at 08:54
but in the code result, i only can see big dot character and the words inside the document. So if i can convert this into list, i believe i can do what i want – Shansal Dec 27 '11 at 09:01
isnt there any other way to fetch my data? – Shansal Dec 27 '11 at 09:02
Try to find the character code of the "big dot character" (try `ord()` on the first few characters of the document) and then split the string using this character (use `unichr()` to convert the code into a string). This will work for very simple word documents without nested lists. – Aaron Digulla Dec 27 '11 at 09:16
when i do list(content), i achieve all the characters in document and i see that every big dot character is indicated by u'\r', u'\x07' – Shansal Dec 27 '11 at 09:41
`\r` is a line feed and `\x07` is a "ring the bell" code. That doesn't seem right... – Aaron Digulla Dec 27 '11 at 10:18

score 0 · Answer 2 · answered Jan 27 '14 at 16:46

0

You could just parse the word document line by line. It isn't elegant and it certainly isn't pretty but it works. Here's a snippet from something similar I've done in python 3.3.

import os
directory='your/path/to/file/'
file='yourword.doc'
doc=open(directory+file,'r+b')
for line in doc:
    line2=str(line)
    print(line2))

I used a regular expression to get just what I needed. But this code will read each line of your word document (formatting and all) and convert it to nice strings that you can deal with. Not sure if this is helpful at all (this post is a couple of years old) but at least it parses the word document. Then it's just a matter of getting rid of strings you don't want before writing to a txt file.

answered Jan 27 '14 at 16:46

Ryan

259
3
10

Did this really work for you? Looks like a risky way of looking into a word file. Was it just text and no formatting? – Evgeny Oct 05 '17 at 08:34
I believe that I was working with the older .doc format not .docx (as the question asked), if you are in that format, then the formatting is saved as byte strings, while the text is saved as plain text. I'm not 100% certain that this would work on the newer format. But in principle it should work. – Ryan Oct 05 '17 at 20:32

score 0 · Answer 3 · edited Jun 20 '20 at 09:12

0

Now i want to convert this text into a list which contains all its items. I used

content = " ".join(content.replace(u"\xa0", " ").strip().split())

Its not a list. What is the problem?

The .join method always returns a string. It expects you to pass a list and will then concatenate that list with the given delimiter (" " in your case).

Apart from that, what Aaron Digulla said.

edited Jun 20 '20 at 09:12

Community

1
1

answered Dec 27 '11 at 09:16

Fabian

4,160
20
32

score 0 · Answer 4 · edited May 23 '17 at 12:08

0

check this post in this link and its comments : Converting Word documents to text (Python recipe)

also this post may be useful: python convert microsoft office docs to plain text on linux

edited May 23 '17 at 12:08

Community

1
1

answered Dec 27 '11 at 09:37

Abdurahman

628
8
16

Parse Word Document in Python

4 Answers4