Opening .doc(x) and .pdf files with cyrillic names in python 3.5.2

Question

The problem I am having is as follows: I am making a python script intended to list the .doc(x) and .pdf files which are found in a specified directory tree and return the total sum of their pages:

def allFiles():
page_count = 0
counter = 1
path = pathName()

f = open(path + '\\' + 'Spisak svih fajlova.txt', 'w')
f.write('Spisak fajlova: ' + '\n')

file_list = []
file_path = []

for folderName, subfolders, files in os.walk(path):
    for filename in files:
        if (filename.endswith('.doc') or filename.endswith('.docx') or filename.endswith('.pdf')):
            file_list.append(filename)
            file_path.append(os.path.join(folderName, filename))

print('\n' + 'Broj fajlova je: %g' % len(file_list) + '\n')
print(file_list)
print()
# print(file_path)

word = win32com.client.Dispatch('Word.Application')

for filename in file_path:
    if filename.endswith('.pdf'):
        pdf = PdfFileReader(open(filename, 'rb'))
        num_pages = pdf.getNumPages()
        page_count += num_pages
        f.write('%g. ' % counter + os.path.basename(filename) + ',' + ' %g' % num_pages + ',' + '\n')
        counter += 1
    elif (filename.endswith('.doc') or filename.endswith('.docx')):
        wordfile = word.Documents.Open(filename)
        wordfile.Repaginate()
        num_pages = wordfile.ComputeStatistics(2)
        page_count += num_pages
        wordfile.Close()
        f.write('%g. ' % counter + os.path.basename(filename) + ',' + ' %g' % num_pages + ',' + '\n')
        counter += 1

word.Quit()
f.write('\n' + 'Ukupan broj stranica je: %g' % page_count)
f.close()

print('\n' + 'Broj stranica je: %g' % page_count)
return page_count

The script does its job beautifully until I try to have it process a file with a (Serbian) Cyrillic title or non-English characters.

The error I get is:

    Traceback (most recent call last):
  File "broj_stranica_2.py", line 165, in <module>
    result()
  File "broj_stranica_2.py", line 160, in result
    allFiles()
  File "broj_stranica_2.py", line 122, in allFiles
    print(file_list)
  File "C:\Anaconda3\lib\encodings\cp852.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 2-10: character maps to <undefined>

To fix this, I have tried entering chcp 65001 command in my cmd and got partially successful results as it solved the issue of non-English Latin characters. Next, I added # -*- encoding: utf-8 -*- at the start of the script, but to no avail. I then tried adding encription='utf8' and 'rb' for read binary (to the open statement for the .doc(x) part) which resulted in the same error. Trying the filename.decode(utf8) gave me the 'string has no decode property' error.

The helper function I am using to get the initial path is:

def pathName():
path = input('Unesi lokaciju fajlova: ')
return path

The Python version I'm using is 3.5.2 (installed with anaconda). Using PyPDF2 to manipulate .pdf files and win32com to manipulate .doc(x).

Names of the files I tried processing are 'асдљњеѕџц.docx' and 'љњегфдасд.pdf'.

Possible duplicate of [Python, Unicode, and the Windows console](http://stackoverflow.com/questions/5419/python-unicode-and-the-windows-console) — roeland, Jan 31 '17 at 20:09

score 1 · Accepted Answer · edited May 23 '17 at 11:53

The answer came from the Python, Unicode, and the Windows console, as marked by @roeland, and answer by @J.F. Sebastian, thanks :).

The issue wasn't really in opening the file but in the print statement, printing it's name in the console.

However, if you do need it printed to console, what worked for me, was using the win-unicode-console module. Simply import it and enable it by:

import win_unicode_console
win_unicode_console.enable()

Opening .doc(x) and .pdf files with cyrillic names in python 3.5.2

1 Answers1