0

I have the task to check the encoding of a file. Actually, my problem is the encoding formats which python provides in its encoding function. I´m very new to python so I think that I overlook something.

I can´t understand the following points:

  • When I´m encoding a file which has the utf-8 BOM format then the encoding function tells me that it is utf-8.

  • When I´m checking the iso8859_6 format then it tells me that he couldn´t recognize this format even though the file has the iso8859_6 format but in case I check "cp720" then it´s able to recognize it

According to this documentation, it should be able to recognize the iso8859_6 format

I´ve tried to find something understandable in the www but couldn´t find something.

import codecs
import io

class Format:

    def __init__(self, file_Name):
        self.file_Name = file_Name

    def check_coding(self):

        encoding_formats = ['iso8859_6','utf-8', 'utf-8-sig', 'ascii']


        for ex in encoding_formats:
            try: 
                fh = codecs.open(self.file_Name, 'r', encoding=ex)

                fh.readlines()
                fh.close()

            except UnicodeDecodeError:
                    print('Die angelieferte Datei ist nicht nach %s  kodiert' % ex)
                    response = False;
            else:
                print('Angelieferte Datei besitzt folgende Kodierung:  %s ' % ex)
                response = True;
                break

        return response

file_Name format is utf-8 BOM so it shouldn´t tell me it´s utf-8.

if the file_Names format is iso8859_6 it tells me that it´s not coded in this format even though it is.

Thierry Lathuille
  • 23,663
  • 10
  • 44
  • 50
Maiwand
  • 127
  • 1
  • 14
  • Are you able to share a minimal string which, when encoded and written to file, reproduces these behaviours? Also, note that UTF-8 will always decode ASCII successfully, so if you want to detect ASCII you'll need to put it before UTF-8 in the encodings list. – snakecharmerb Apr 18 '19 at 08:28
  • 2
    There's no guarantee that using the wrong codec will cause an exception. And snakecharmerb's note about ASCII also applies to UTF-8-sig. – lenz Apr 18 '19 at 08:42
  • @snakecharmerb Yes the input of the file is this text: Diese Datei ist eine Testdatei für das Python Encoding Modul – Maiwand Apr 18 '19 at 09:08
  • @snakecharmerb Thanks for your note – Maiwand Apr 18 '19 at 09:09
  • If you accept text files that use a character encoding from a limited list, you should only need to check that each file is valid for the one encoding that it uses. If you don't know what that is, you have a process problem that is resulting in data loss. – Tom Blodget Apr 18 '19 at 16:41
  • Possible duplicate of [How to determine the encoding of text?](https://stackoverflow.com/questions/436220/how-to-determine-the-encoding-of-text) – snakecharmerb Apr 19 '19 at 07:11
  • @TomBlodget Thanks for your information. But could you explain what you exactly mean with data loss or the process problem? – Maiwand Apr 24 '19 at 07:50
  • 1
    There is no text but encoded text. When you receive the bytes for text, they are meaningless without knowledge of the character encoding. There are various ways this might be communicated. If the bytes are received through an HTTP GET response body, the response's [Content-Type](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type) should say. If the bytes are received through an HTTP POST with an HTML file upload, a different field should have asked the user to supply the character encoding or MIME type (or just tell them to only supply UTF-8 files, for example). – Tom Blodget Apr 24 '19 at 16:02

0 Answers0