2

I have a hundred files and according to chardet each file is encoded with one of the following:

['UTF-8', 'ascii', 'ISO-8859-2', 'UTF-16LE', 'TIS-620', 'utf-8', 'SHIFT_JIS', 'ISO-8859-7']

So I know the files encoding, therefore I know what encoding to open the file with.

I wish to convert all files to ascii only. I also wish to convert different versions of characters like - and ' to their plain ascii equivalents. For example b"\xe2\x80\x94".decode("utf8") should be converted to -. The most important thing is that the text is easy to read. I don't want don t for example, but rather don't instead.

How might I do this?

I can use either Python 2 or 3 to solve this.

This is as far as I got for Python2. I'm trying to detect those lines which continua non ascii characters to begin with.

for file_name in os.listdir('.'):
        print(file_name)
        r = chardet.detect(open(file_name).read())
        charenc = r['encoding']
        with open(file_name,"r" ) as f:
            for line in f.readlines():
              if line.decode(charenc) != line.decode("ascii","ignore"):
                print(line.decode("ascii","ignore"))

This gives me the following exception:

    if line.decode(charenc) != line.decode("ascii","ignore"):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_16_le.py", line 16, in decode
    return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 6: truncated data
Baz
  • 12,713
  • 38
  • 145
  • 268
  • 6
    It looks like you want us to write some code for you. While many users are willing to produce code for a coder in distress, they usually only help when the poster has already tried to solve the problem on their own. A good way to demonstrate this effort is to include the code you've written so far, example input (if there is any), the expected output, and the output you actually get (console output, stack traces, compiler errors - whatever is applicable). The more detail you provide, the more answers you are likely to receive. – Martijn Pieters Oct 28 '13 at 20:31
  • 2
    Why do you want to convert these to ASCII? You'll lose accents and all that cool stuff! And as Martjin said, show us you tried and aren't just looking for someone to do your work – ThinkChaos Oct 28 '13 at 20:33
  • @Martijn Pieters I've now updated my answer – Baz Oct 28 '13 at 21:03
  • @plg I don't want the cool stuff! – Baz Oct 28 '13 at 21:10

1 Answers1

6

Don't use .readlines() an a binary file with multi-byte lines. In UTF-16, little-endian, a newline is encoded as two bytes, 0A (in ASCII a newline) and 00 (a NULL). .readlines() splits on the first of those two bytes, leaving you with incomplete data to decode.

Reopen the file with the io library for ease of decoding:

import io

for file_name in os.listdir('.'):
    print(file_name)
    r = chardet.detect(open(file_name).read())
    charenc = r['encoding']
    with io.open(file_name, "r", encoding=charenc) as f:
        for line in f:
            line = line.encode("ascii", "ignore"):
            print line

To replace specific unicode codepoints with ASCII-friendly characters, use a dictionary mapping codepoint to codepoint or unicode string and call line.translate() first:

charmap = {
    0x2014: u'-',   # em dash
    0x201D: u'"',   # comma quotation mark, double
    # etc.
}

line = line.translate(charmap)

I used hexadecimal integer literals to define the unicode codepoints to map from here. The value in the dictionary must be a unicode string, an integer (a codepoint) or None to delete that codepoint altogether.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • This is giving me: UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128) on the line with "ignore". charenc = "UTF-8" – Baz Oct 28 '13 at 21:18
  • 1
    @Baz: sorry, my mistake; that should have been a *encode()*, not `decode()`. – Martijn Pieters Oct 28 '13 at 21:21
  • open(file_name).read() can be a problem if the file is very big (75GB in my case). open(file_name).readline() will work as well. – JPMagalhaes Apr 04 '18 at 16:25