'''This script is to copy text from documents (docx) to simple text file
'''
import sys
import ntpath
import os
from docx import Document
docpath = os.path.abspath(r'C:\Users\Khairul Basar\Documents\CWD Projects\00_WORKING\WL_SLOT1_submission_date_30-03-2018\1-100')
txtpath = os.path.abspath(r'C:\Users\Khairul Basar\Documents\CWD Projects\00_WORKING\WL_SLOT1_submission_date_30-03-2018\Textfiles')
for filename in os.listdir(docpath):
try:
document = Document(os.path.join(docpath, filename))
# print(document.paragraphs)
print(filename)
savetxt = os.path.join(txtpath, ntpath.basename(filename).split('.')[0] + ".txt")
print('Reading ' + filename)
# print(savetxt)
fullText = []
for para in document.paragraphs:
# print(para.text)
fullText.append(para.text)
with open(savetxt, 'wt') as newfile:
for item in fullText:
newfile.write("%s\n" % item)
# with open(savetxt, 'a') as f:
# f.write(para.text)
# print(" ".join([line.rstrip('\n') for line in f]))
# newfile.write(fullText)
# newfile.save()
# newfile.save()
#
# newfile.write('\n\n'.join(fullText))
# newfile.close()
except:
# print(filename)
# document = Document(os.path.join(docpath, filename))
# print(document.paragraphs)
print('Please fix an error')
exit()
# print("Please supply an input and output file. For example:\n"
# # " example-extracttext.py 'My Office 2007 document.docx' 'outp"
# "utfile.txt'")
# Fetch all the text out of the document we just created
# Make explicit unicode version
# Print out text of document with two newlines under each paragraph
print(savetxt)
Above python 3 script is to read Docx file and create txt files. In one directory I have 100s docx files, but it is only creating 19 txt files and then exiting the program. I couldn't figure why?
Docx files are output files from OCR software, all are English text ( no image, tables or graph or something special).
Today again I run the program after removing the Try/Except instruction and result is same:
1.docx
Reading 1.docx
10.docx
Reading 10.docx
100.docx
Reading 100.docx
11.docx
Reading 11.docx
12.docx
Reading 12.docx
13.docx
Reading 13.docx
14.docx
Reading 14.docx
15.docx
Reading 15.docx
16.docx
Reading 16.docx
17.docx
Reading 17.docx
18.docx
Reading 18.docx
Traceback (most recent call last):
File "C:\Users\Khairul Basar\Documents\CWD Projects\docx2txtv2.py", line 26,
in
newfile.write("%s\n" % item)
File "C:\Python36\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0113' in position
77: character maps to
Some other post Here resolve this by .encode("utf-8") But if i use it then I get b' my text' in every line - which i don't need.
UPDATE fixed
I have made change to following line: with open(savetxt, 'w', encoding='utf-8') as newfile:
by adding encoding='utf-8'
help i took from this post. post
Thank you who has formated my post in a nice way.