How to make a corpus of files in text-format based on parse of text & titles from MS Word documents in Python?

Question

I'm preprocessing/preparing a batch of MS Word documents, which I automatically converted from .doc to .docx to use them later to train an NLP-model with entity recognition.

I'm a newbie in Python programming as well as in spaCy NLP, but I have some programming experience in other languages. Right now, my biggest question that makes me feel like "I don't know what to do or how to do it" is this:

I have the documents in a folder. I need to parse the raw text and titles (which are in the name of the document itself, not the first line in the document) to make a corpus which is going to be used later on to train the NLP model.

Since I'm a newbie I have a lot to learn. So I've already done a lot of research on this topic. In the beginning it was too much work for me to convert all these .doc files to .docx files. But I've finally found a way to do that more conveniently.

Since I need to get the title and the text from a bunch of documents, I assumed that I needed to walk over the documents in the folder, using a for-loop, which I did like this:

path = '/path/to/folder'
for filename in os.listdir(path):
    if filename.endswith('.docx'):
        path = os.path.join(path, filename)

I've also tried what I found in this question (using the native python-docx module).

But this produced the error:

TypeError: sequence item 0: expected str instance, bytes found

Edit: The TypeError problem is solved, I tried again 3 different ways to extract text from a Word Document and this one gave me the best output (without errors):

import docx
def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

print(getText('test.docx'))

So now I (finally) know how to do a good text-extraction from a Word document. I still need to figure out how to do this on a whole folder and what are my next steps in the process in order to make a corpus that will be used for NLP.

Btw. I'm using Pycharm in a Ubuntu 18.04 virtual machine and Python 3.6.

I've also explained my problem a bit in a different way in this post (see comment #9).

I posted this yesterday, it was before trying out what I've found in the question I mentioned earlier on Stack Overflow.

Could anyone give me any idea on what is a good way to extract titles from MS Word document in order to make a corpus of files to use in spaCy?

Thank you very much to take your time.

Please can you post the code you used to actually try to extract text from the files, and the full traceback for the TypeError that you got? At the moment you say you've tried something but we can't see exactly what. — Tom Dalton, Sep 20 '19 at 10:29
Have you checked this post? https://stackoverflow.com/questions/25228106/how-to-extract-text-from-an-existing-docx-file-using-python-docx — Tiago Duque, Sep 20 '19 at 10:56
@TomDalton Right now I have tried 3 different ways to extract text from a word document. I used the code which I've found in Tiago Duque 's link. I've been trying this code a few days ago as well but somehow It didn't work back then but now it works. I've eddited my text and wrote the code there. As output i don't get any errors anymore but I'm wondering how I should do this on a whole bunch of documents and save the files to make an NLP-corpus.. — Jonas, Sep 20 '19 at 13:52

How to make a corpus of files in text-format based on parse of text & titles from MS Word documents in Python?

0 Answers0