I'm preprocessing/preparing a batch of MS Word documents, which I automatically converted from .doc
to .docx
to use them later to train an NLP-model with entity recognition.
I'm a newbie in Python programming as well as in spaCy NLP, but I have some programming experience in other languages. Right now, my biggest question that makes me feel like "I don't know what to do or how to do it" is this:
I have the documents in a folder. I need to parse the raw text and titles (which are in the name of the document itself, not the first line in the document) to make a corpus which is going to be used later on to train the NLP model.
Since I'm a newbie I have a lot to learn. So I've already done a lot of research on this topic. In the beginning it was too much work for me to convert all these .doc
files to .docx
files. But I've finally found a way to do that more conveniently.
Since I need to get the title and the text from a bunch of documents, I assumed that I needed to walk over the documents in the folder, using a for-loop
, which I did like this:
path = '/path/to/folder'
for filename in os.listdir(path):
if filename.endswith('.docx'):
path = os.path.join(path, filename)
I've also tried what I found in this question (using the native python-docx
module).
But this produced the error:
TypeError: sequence item 0: expected str instance, bytes found
Edit:
The TypeError
problem is solved, I tried again 3 different ways to extract text from a Word Document and this one gave me the best output (without errors):
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
print(getText('test.docx'))
So now I (finally) know how to do a good text-extraction from a Word document. I still need to figure out how to do this on a whole folder and what are my next steps in the process in order to make a corpus that will be used for NLP.
Btw. I'm using Pycharm in a Ubuntu 18.04 virtual machine and Python 3.6.
I've also explained my problem a bit in a different way in this post (see comment #9).
I posted this yesterday, it was before trying out what I've found in the question I mentioned earlier on Stack Overflow.
Could anyone give me any idea on what is a good way to extract titles from MS Word document in order to make a corpus of files to use in spaCy?
Thank you very much to take your time.