3

My Django application is working with both .txt and .doc filetypes. And this application opens a file, compares it with other files in db and prints out some report.

Now the problem is that, when file type is .txt, I get 'utf-8' codec can't decode byte error (here I'm using encoding='utf-8'). When I switch encoding='utf-8' to encoding='ISO-8859-1' error changes to 'latin-1' codec can't decode byte.

I want to find such encoding format that works with every type of a file. This is a small part of my function:

views.py:

@login_required(login_url='sign_in')
def result(request):
    last_uploaded = OriginalDocument.objects.latest('id')
    original = open(str(last_uploaded.document), 'r', encoding='utf-8')
    original_words = original.read().lower().split()
    words_count = len(original_words)
    open_original = open(str(last_uploaded.document), "r")
    read_original = open_original.read()
    report_fives = open("static/report_documents/" + str(last_uploaded.student_name) + 
    "-" + str(last_uploaded.document_title) + "-5.txt", 'w')
    # Path to the documents with which original doc is comparing
    path = 'static/other_documents/doc*.txt'
    files = glob.glob(path)

    rows, found_count, fives_count, rounded_percentage_five, percentage_for_chart_five, fives_for_report, founded_docs_for_report = search_by_five(last_uploaded, 5, original_words, report_fives, files)

    context = {
      ...
    }

    return render(request, 'result.html', context)
colidyre
  • 4,170
  • 12
  • 37
  • 53
Bob Reynolds
  • 929
  • 3
  • 8
  • 21
  • Do you really use `python2.7` as tagged? Then do you use `from codecs import open`? Or do you use python3 with built-in `open()` function? – colidyre Apr 05 '20 at 21:07
  • 2
    *I want to find such encoding format that work with every type of file.* -> There is no general encoding which automatically knows how to decode an already encoded file in a specific encoding. UTF-8 is a good option with many compatibilites with other encodings. You can e.g. simply ignore or replace characters which aren't decodable like this: `open(, encoding="utf-8", errors="ignore")` or (`errors="replace"`). Maybe this helps?! – colidyre Apr 05 '20 at 21:10
  • @colidyre I used ```errors="replace"``` and it worked. Thank you so much. But, can you tell me please, which of them is safe ? ```ignore``` or ```replace``` ? – Bob Reynolds Apr 05 '20 at 21:35
  • 1
    How are you reading the doc file? Are you using any library, file open does not support doc – Harsh Nagarkar Apr 05 '20 at 22:12
  • I've copied my comment into an answer with better explaining since it seems what you want. – colidyre Apr 05 '20 at 22:26
  • @BobReynolds Do you mean by "safe" that the program will not raise a decoding error? Than, yes -- because you're **not** using `errors="strict"` then. – colidyre Apr 05 '20 at 22:30
  • @colidyre why I asked that question, because my application shows doc file as something else. I thought maybe its because of errors keyword. But none of them fixed that problem – Bob Reynolds Apr 05 '20 at 23:44
  • Ah okay, I see. I've updated my answer. @BobReynolds – colidyre Apr 06 '20 at 02:03

1 Answers1

3

There is no general encoding which automatically knows how to decode an already encoded file in a specific encoding.

UTF-8 is a good option with many compatibilities with other encodings. You can e.g. simply ignore or replace characters which aren't decodable like this:

from codecs import open
original = open(str(last_uploaded.document), encoding="utf-8", errors="ignore")
original_words = original.read().lower().split()
...
original.close()

Or even using a context manager (with statement) who closes the file for you:

with open(str(last_uploaded.document), encoding="utf-8", errors="ignore") as fr:
    original_words = fr.read().lower().split()
    ...

(Note: You do not need to use the codecs library if you're using Python 3, but you have tagged your question with python-2.7.)

You can see advantages and disadvantages of using different error handlers here and here. You have to know that not using an error handler will default to using errors="strict" which you probably do not want. Other options may be nearly self-explaining, e.g.:

  • using errors="replace" will replace an undecodable character with a suitable replacement marker
  • using errors="ignore" will simply ignore the character and continues reading the file data.

What you should use depends on your needs and usecase(s).

You're saying that you also have encoding problems not only with plain text files, but also with proprietary doc files:

The .doc format is not a plain text file which you can simply read with open() or codecs.open() since there are many information stored in binary format, see this site for more information. So you need a special reader for .doc files to get the text from it. Which library you are using depends on your Python version and maybe also on the operating system you are using. Maybe here is a good starting point for you.

Unfortunately, using a library does not prevent you completely from encoding errors. (Maybe yes, but I'm not sure if the encoding is saved in the file itself like in a .docx file.) You maybe also have the chance to figure out the encoding of the file. How you can handle encoding errors likely depends on the library itself.

So I just guess that you are trying opening .doc files as simple text files. Then you will get decoding errors, because it's not saved as human readable text. And even if you get rid of the error, you only will see the non human readable text: (I've created a simple text file with LibreOffice in doc-format (Microsoft Word 1997-2003)):

In [1]: open("./test.doc", "r").read()
UnicodeDecodeError: 'utf-8' codec can`t decode byte 0xd0 in position 0: invalid continuation byte

In [2]: open("./test.doc", "r", errors="replace").read()  # or open("./test.doc", "rb").read()
'��\x11\u0871\x1a�\x00\x00\x00' ...
colidyre
  • 4,170
  • 12
  • 37
  • 53
  • Thank you so much, friend. Now I have some ways that I can try. But I think reading ```doc``` files must not be such a hard thing in 2020, but what we can do. – Bob Reynolds Apr 06 '20 at 06:17
  • Man, hello, have a good day. Can you check please my this question https://stackoverflow.com/questions/61156491/django-form-save-is-not-creating-modelform ? Last days my question reach very few people, and can't tag someone. Please, help me. – Bob Reynolds Apr 11 '20 at 12:40