The problem I am having is as follows: I am making a python script intended to list the .doc(x) and .pdf files which are found in a specified directory tree and return the total sum of their pages:
def allFiles():
page_count = 0
counter = 1
path = pathName()
f = open(path + '\\' + 'Spisak svih fajlova.txt', 'w')
f.write('Spisak fajlova: ' + '\n')
file_list = []
file_path = []
for folderName, subfolders, files in os.walk(path):
for filename in files:
if (filename.endswith('.doc') or filename.endswith('.docx') or filename.endswith('.pdf')):
file_list.append(filename)
file_path.append(os.path.join(folderName, filename))
print('\n' + 'Broj fajlova je: %g' % len(file_list) + '\n')
print(file_list)
print()
# print(file_path)
word = win32com.client.Dispatch('Word.Application')
for filename in file_path:
if filename.endswith('.pdf'):
pdf = PdfFileReader(open(filename, 'rb'))
num_pages = pdf.getNumPages()
page_count += num_pages
f.write('%g. ' % counter + os.path.basename(filename) + ',' + ' %g' % num_pages + ',' + '\n')
counter += 1
elif (filename.endswith('.doc') or filename.endswith('.docx')):
wordfile = word.Documents.Open(filename)
wordfile.Repaginate()
num_pages = wordfile.ComputeStatistics(2)
page_count += num_pages
wordfile.Close()
f.write('%g. ' % counter + os.path.basename(filename) + ',' + ' %g' % num_pages + ',' + '\n')
counter += 1
word.Quit()
f.write('\n' + 'Ukupan broj stranica je: %g' % page_count)
f.close()
print('\n' + 'Broj stranica je: %g' % page_count)
return page_count
The script does its job beautifully until I try to have it process a file with a (Serbian) Cyrillic title or non-English characters.
The error I get is:
Traceback (most recent call last):
File "broj_stranica_2.py", line 165, in <module>
result()
File "broj_stranica_2.py", line 160, in result
allFiles()
File "broj_stranica_2.py", line 122, in allFiles
print(file_list)
File "C:\Anaconda3\lib\encodings\cp852.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 2-10: character maps to <undefined>
To fix this, I have tried entering chcp 65001
command in my cmd and got partially successful results as it solved the issue of non-English Latin characters.
Next, I added # -*- encoding: utf-8 -*-
at the start of the script, but to no avail.
I then tried adding encription='utf8'
and 'rb'
for read binary (to the open statement for the .doc(x) part) which resulted in the same error. Trying the filename.decode(utf8)
gave me the 'string has no decode property' error.
The helper function I am using to get the initial path is:
def pathName():
path = input('Unesi lokaciju fajlova: ')
return path
The Python version I'm using is 3.5.2 (installed with anaconda). Using PyPDF2 to manipulate .pdf files and win32com to manipulate .doc(x).
Names of the files I tried processing are 'асдљњеѕџц.docx' and 'љњегфдасд.pdf'.