How to create corpus from multiple docx files in Python

Question

I have a folder that consists of various 10 docx files. I am trying to create a corpus, which should be a list of length 10. Each element of the list should refer to the text of each docx document.

I have following function to extract text from docx files:

            import os
            from nltk.corpus.reader.plaintext import PlaintextCorpusReader
            import glob 
            from docx import *
            def getText(filename):
                document = Document(filename)

                newparatextlist = []
                for paragraph in document.paragraphs:
                    newparatextlist.append(paragraph.text.strip().encode("utf-8")) 
                return newparatextlist

            path = 'pat_to_folder/*.docx'   
            files=glob.glob(path)  

            corpus_list = []
            for f in files:
                cur_corpus = getText(f)
                corpus_list.append(cur_corpus)

            corpus_list[0]

However, if I have content as follows in my word documents: http://www.actus-usa.com/sampleresume.doc https://www.myinterfase.com/sjfc/resources/resource_view.aspx?resource_id=53

the above function creates a list of list. How can I simply create a corpus out of the files?

TIA!

without seeing an example of your files we can be sure. what do you mean by corpus, a list of the text from the 10 documents? — parsethis, Feb 23 '17 at 05:01
use `extend` rather than `appent` when adding text to newparatextlist. — cco, Feb 23 '17 at 07:40
Duplicate of http://stackoverflow.com/questions/24104908/how-to-create-docx-files-with-python ? Did the answer help in the linked question? — alvas, Feb 23 '17 at 08:18
@putonspectacles, example file is as attached in the link above: http://www.actus-usa.com/sampleresume.doc. You are right, I am looking to create a list of text from 10 documents. Each element in that list should be text from each document. With the method I used above, I get a list of 10 lists. When I try to flatten it out, I get one list where each element is a line from the file and not the whole text from the file. — Craig Bing, Feb 23 '17 at 13:42
@cco: Thanks for the suggestion. I tried that but based on the sample file attached, I get a list where every element refers to each character in the file. — Craig Bing, Feb 23 '17 at 13:47
@alvas: How can this be a duplicate? I read that question before posting mine. I do not want to create a docx file. I already have docx files and I am trying to create a corpus out of it. I am going the other way around. — Craig Bing, Feb 23 '17 at 13:48
My mistake - the change is to make `corpus_list.append()` into `corpus_list.extend()`. Since `getText()` returns a list, appending it to `corpus_list` gets you a list of lists, while extending it adds each of the elements of the list returned by `getText()` to `corpus_list`. — cco, Feb 23 '17 at 22:36
@CraigBing, sorry but possibly a duplicate of http://stackoverflow.com/questions/25228106/how-to-extract-text-from-an-existing-docx-file-using-python-docx? — alvas, Feb 24 '17 at 00:56
I don't think it's a question about NLTK as much as a matter of reading `.docx` files through python. And once you've a Pythonic `str` or `File` object from the `.docx` file, it should work. The `PlaintextCorpusReader` should read plain text files (https://en.wikipedia.org/wiki/Plain_text) and `.docx` isn't plain text. — alvas, Feb 24 '17 at 00:57
try to use PlaintextCorpusReader as corpus = PlaintextCorpusReader("./news", ".*\.txt"), from the url https://pynlp.wordpress.com/2013/12/10/unit-5-part-ii-working-with-files-ii-the-plain-text-corpus-reader-of-nltk/ I didn't test this example. — Ricardo Rivaldo, Mar 02 '18 at 19:32

score 1 · Answer 1 · answered Sep 03 '19 at 10:37

I tried this on some different method for my problem. It also consisted of loading various docx files to a corpus... I made some slight changes to your code!

    def getText(filename):
        doc = Document(filename)
        fullText = []
        for para in doc.paragraphs:
            fullText.append(para.text.strip("\n"))
        return " ".join(fullText)

    PATH = "path_to_folder/*.docx"
    files = glob.glob(PATH)

    corpus_list = []
    for f in files:
        cur_corpus = getText(f)
        corpus_list.append(cur_corpus)

hopefully this solves the problem!

score 0 · Answer 2 · answered Apr 19 '20 at 12:52

0

from nltk.corpus.reader.plaintext import PlaintextCorpusReader

corpus = PlaintextCorpusReader(ROOT_PATH, '*.docx')

It should create corpus from all the content of docx files present in the ROOT_PATH

answered Apr 19 '20 at 12:52

pratap

538
1
5
11

How to create corpus from multiple docx files in Python

2 Answers2