I have a folder that consists of various 10 docx files. I am trying to create a corpus, which should be a list of length 10. Each element of the list should refer to the text of each docx document.
I have following function to extract text from docx files:
import os
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
import glob
from docx import *
def getText(filename):
document = Document(filename)
newparatextlist = []
for paragraph in document.paragraphs:
newparatextlist.append(paragraph.text.strip().encode("utf-8"))
return newparatextlist
path = 'pat_to_folder/*.docx'
files=glob.glob(path)
corpus_list = []
for f in files:
cur_corpus = getText(f)
corpus_list.append(cur_corpus)
corpus_list[0]
However, if I have content as follows in my word documents: http://www.actus-usa.com/sampleresume.doc https://www.myinterfase.com/sjfc/resources/resource_view.aspx?resource_id=53
the above function creates a list of list. How can I simply create a corpus out of the files?
TIA!