0

I was looping some files to copy the content of somes file to a new file but after I run the code, the result shows lot of symbols in the new file, not the text content of the files I looped.

first, when I ran the code without putting the 'encoding' attribute in open file line, it showed an error message like, UnicodeEncodeError: 'charmap' codec can't encode character '\x8b' in position 12: character maps to .

I tried various encodings like utf-8,latin1 but nothing worked and when i put 'errors=ignore' in the open file line, then the result showed like I described above.

import os import glob

folder = os.path.join('R:', os.sep, 'Files')

def notes():

for doc in glob.glob(folder + r'\*'):
    if doc.endswith('.pdf'):
        with open(doc,'r') as f:
            x = f.readlines()
        with open('doc1.text', 'w+') as f1:
            for line in x:
                f1.write(line)

notes()

1 Answers1

0

If I understand your example correctly and you’re trying to read PDF files, your problem is not one of encoding but of file format. PDF files don’t just to store your text in coding materials are unique format that you need to be able to read in order to extract the text. There are a couple of python libraries that can read PDF files (such as Py2PDF), please refer to this thread for more information: How to extract text from a PDF file?

Rom
  • 143
  • 1
  • 8
  • i tried with .docx files too but the result is the same –  Aug 12 '19 at 06:31
  • @riki PDF's and DOCX's are ***binary files***, pass `'b'` switch to `open` – gboffi Aug 12 '19 at 07:10
  • Again, docx files and pdf files have special Formats which you need to parse. You need to use a some library for that, like pydocx and py2pdf – Rom Aug 13 '19 at 15:11