listing filenames in a directory with .docx extension using python

Question

'''This script is to copy text from documents (docx) to simple text file

'''

import sys
import ntpath
import os
from docx import Document

docpath = os.path.abspath(r'C:\Users\Khairul Basar\Documents\CWD Projects\00_WORKING\WL_SLOT1_submission_date_30-03-2018\1-100')
txtpath = os.path.abspath(r'C:\Users\Khairul Basar\Documents\CWD Projects\00_WORKING\WL_SLOT1_submission_date_30-03-2018\Textfiles')

for filename in os.listdir(docpath):
    try:
        document = Document(os.path.join(docpath, filename))
        # print(document.paragraphs)
        print(filename)
        savetxt = os.path.join(txtpath, ntpath.basename(filename).split('.')[0] + ".txt")
        print('Reading ' + filename)
        # print(savetxt)
        fullText = []
        for para in document.paragraphs:
            # print(para.text)
            fullText.append(para.text)
        with open(savetxt, 'wt') as newfile:
            for item in fullText:
                newfile.write("%s\n" % item)
        # with open(savetxt, 'a') as f:
        # f.write(para.text)
        # print(" ".join([line.rstrip('\n') for line in f]))
        # newfile.write(fullText)
        # newfile.save()
        # newfile.save()
        #
        # newfile.write('\n\n'.join(fullText))
        # newfile.close()

    except:
        # print(filename)
        # document = Document(os.path.join(docpath, filename))
        # print(document.paragraphs)
        print('Please fix an error')
        exit()

    # print("Please supply an input and output file. For example:\n"
 # #  "  example-extracttext.py 'My Office 2007 document.docx' 'outp"
 #   "utfile.txt'")

    # Fetch all the text out of the document we just created

    # Make explicit unicode version

    # Print out text of document with two newlines under each paragraph

print(savetxt)

Above python 3 script is to read Docx file and create txt files. In one directory I have 100s docx files, but it is only creating 19 txt files and then exiting the program. I couldn't figure why?

Docx files are output files from OCR software, all are English text ( no image, tables or graph or something special).

Today again I run the program after removing the Try/Except instruction and result is same:

1.docx
Reading 1.docx
10.docx
Reading 10.docx
100.docx
Reading 100.docx
11.docx
Reading 11.docx
12.docx
Reading 12.docx
13.docx
Reading 13.docx
14.docx
Reading 14.docx
15.docx
Reading 15.docx
16.docx
Reading 16.docx
17.docx
Reading 17.docx
18.docx
Reading 18.docx
Traceback (most recent call last):
File "C:\Users\Khairul Basar\Documents\CWD Projects\docx2txtv2.py", line 26,
in
newfile.write("%s\n" % item)
File "C:\Python36\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0113' in position
77: character maps to

Some other post Here resolve this by .encode("utf-8") But if i use it then I get b' my text' in every line - which i don't need.

UPDATE fixed

I have made change to following line: with open(savetxt, 'w', encoding='utf-8') as newfile:

by adding encoding='utf-8'

help i took from this post. post

Thank you who has formated my post in a nice way.

I am sorry above text is messed up. Here is python code file:https://drive.google.com/open?id=19QISE7aSS4m7lczr5Dr6NcKF9KEqU18f — k.b, Mar 03 '18 at 18:22
In your previous questions you had no problem formatting code, but just in case you forgot (and also managed to ignore the various Help and Format and Preview options around the editor): [How to format your post using Markdown and HTML](https://stackoverflow.com/help/formatting). Apart from that, there is not enough information to suggest an answer. Is there anything special about the files, other than those 19 that did work? File type? Name? Size? Contents? — Jongware, Mar 04 '18 at 10:22
(cont'd): **required first step**: remove the `try/except` so you can actually *see* why a file fails... — Jongware, Mar 04 '18 at 10:24
I made a change as suggested by usr2564301 but I get the same result. — k.b, Mar 04 '18 at 14:31
Yeah of course you do. But that was not suggested as a "fix"! Now you know what the problem is, and where it occurs, instead of "it doesn't work". — Jongware, Mar 04 '18 at 14:32

score 1 · Answer 1 · answered Mar 04 '18 at 14:45

1

usr2564301 has pointed out to remove Try/except from the code. By doing so i got exact error why it was not working or exiting the program prematurely.

The problem was my Docx has many characters which are beyond 8-bit character set. To convert that non-english characters to English encoding='utf-8' is used.

That solved the problem.

anyway, all credit goes to usr2564301 who is somewhere I don't know.

answered Mar 04 '18 at 14:45

k.b

157
1
2
13

:) Just trying to guide you in the direction of finding a possible solution. Glad you were able to find it out! – Jongware Mar 04 '18 at 15:23

listing filenames in a directory with .docx extension using python

1 Answers1