3

I have made a module that detects the encoding of a file. I want to be able to able to give file path and encoding as inputs to the class and always be able to get back 'utf-8' when I process the contents of the file.

For example something like this

handler = UnicodeWrapper(file_path, encoding='ISO-8859-2')

for line in handler:
   # need the line to be encoded in utf-8
   process(line)

I can not understand why there are a million types of encodings yet. But I want to write an interface that always returns unicode.

Is there a library to do this already?

Iron Fist
  • 10,739
  • 2
  • 18
  • 34
Ranjith Ramachandra
  • 10,399
  • 14
  • 59
  • 96
  • Not exactly, but the `Codecs` module gives you wrappers that allow you to read a file into unicode strings, more or less what Python3 open also directly allows. – Serge Ballesta Jan 05 '17 at 12:46

1 Answers1

0

Based on this answer, I think the following might suit your needs:

import io

class UnicodeWrapper(object):
    def __init__(self, filename):
        self._filename = filename

    def __iter__(self):
        with io.open(self._filename,'r', encoding='utf8') as f:
            return iter(f.readlines())

if __name__ == '__main__':
    filename = r'...'

    handler = UnicodeWrapper(filename)

    for line in handler:
       print(line)

Edit

In Python 2, you can assert that each line is encoded in UTF-8 using something like this:

if __name__ == '__main__':
    filename = r'...'

    handler = UnicodeWrapper(filename)

    for line in handler:
        try:
            line.decode('utf-8')
            # process(line)
        except UnicodeDecodeError:
            print('Not encoded in UTF-8')
Community
  • 1
  • 1
Tagc
  • 8,736
  • 7
  • 61
  • 114