Python failed to parse txt file but the file is confirmed to be 'txt' file

Question

I have a piece of python code that reads from a txt file properly, but my colleague gave me another set of files that appears to be of type txt file as well. But when I ran the same python code, each line is read incorrectly. For the new files, if the line is 240,022414114120,-500,Bauer_HS5,0 It would be read as str:2[]4[]0 []0[]2[]2[]4..... All those little rectangles between each character and the leading question mark characters are all invalid characters. And it will further get converted to something like this: [['\xff\xfe2\x004\x000\x00', '\x000\x002\x002\x004\x001\x004\x001\x001\x004\x001\x002\x000\x00', '\x00-\x005\x000\x000\x00',...... However, if I manually create a normal text file and copy/paste the content from the input file, the parsr was able to read each line correctly. So I am thinking the input files are of different type of the normal text file. But the files' suffix are indeed 'txt'.

The files come from a device that regularly sends files to our server. This parser works fine for another device that does the same thing. And the files from both devices are all of type 'txt'.

Each line is read as {{{ for line in self._infile.xreadlines(): }}}

I am very confused why it would behave this way. My python code is following.

def __init__(self, infile=sys.stdin, outfile=sys.stdout):
    if isinstance(infile, basestring):
        infile = open(infile)
    if isinstance(outfile, basestring):
        outfile = open(outfile, "w")

    self._infile = infile
    self._outfile = outfile

def sort(self):
    lines = []
    last_second = None

    for line in self._infile.xreadlines():
        line = line.replace('\r\n', '')
        fields = line.split(',')
        if len(fields) < 2:
            continue
        second = fields[1]
        if last_second and second != last_second:
            lines = sorted(lines, self._sort_lines)
            self._outfile.write("".join([','.join(x) for x in lines]))
            #self._outfile.write("\r\n")
            lines = []

        last_second = second
        lines.append(fields)

    if lines:
        lines = sorted(lines, self._sort_lines)
        self._outfile.write("".join([','.join(x) for x in lines]))
        #self._outfile.write("\r\n")

    self._infile.close()
    self._outfile.close()

Where did this encoding issue happen? The files come from a device that regularly sent files to our server. This parser works fine for another device that does the same thing. And the files from both devices are all of type 'txt'. — user3216886, Mar 03 '14 at 00:35
The 'txt' thing is just a part of the name of the file. It does not matter to python. Could you provide an example line from the file that fails? Also, it seems like most of the lines in your example are not actually related to the problem. It is easier to help if you trim your example down to a minimum. — amaurea, Mar 03 '14 at 00:37
One of the examples is like this: 240,022414114120,-500,Bauer_HS5,0. Then when python reads it, it becomes ??2[]4[]0 []0[]2[]2[]4..... All those little rectangles between each character and the leading question mark characters are all invalid characters. I don't think it's the content. Because if I copy paste the content from the new file to a manually created txt file, python reads it properly. — user3216886, Mar 03 '14 at 00:39
Bitrot is theoretically impossible, but files do become corrupt, can you have the file sent to you again? — Russia Must Remove Putin, Mar 03 '14 at 01:05

score 4 · Accepted Answer · edited May 23 '17 at 10:31

The start of the file you described as coming from your colleague is "\xff\xfe". These two characters make up a "byte order mark" that indicates that the file is encoded with the "UTF-16-LE" encoding (that is, 16-bit Unicode with the lower byte first). Your Python script is reading with an 8-bit encoding (probably whatever your system's default encoding is), so you're seeing lots of extra null characters (the high bytes of the 16-bit characters).

I can't speak to how the file got a different encoding. Windows text editors (like notepad.exe) are somewhat notorious for silently reencoding files in unhelpful ways if you're not careful with them, so it may be that your colleague previewed the file in an editor and then saved it before forwarding it on to you.

Anyway, the simplest fix is probably to reencode the file. There are various utilities to do this on various OSs, or you could write your own easily enough. Here's a quick and dirty function to reencode a file in Python (which will hopefully raise an exception if the encoding parameters are wrong, but perhaps not always):

def renecode_file(filename, from_encoding="UTF-16-LE", to_encoding="ascii"):
    with open(filename, "rb") as f:
        in_bytes = f.read() # read bytes

    text = in_bytes.decode(from_encoding) # decode to unicode

    out_bytes = text.encode(to_encoding) # reencode to new encoding

    with open(filename, "wb") as f:
        f.write(out_bytes) # write back to the file

If the file you get is going to always be encoded in UTF-16, you could change your regular script to decode it automatically. In Python 2.7, I'd suggest using the io module's open function for this (it is the same code that the regular open uses in Python 3). Note however that the file object returned won't support the xreadlines method which has been deprecated for a long time (just iterate over the file directly instead).

Thanks. I think that is exactly what happened. I believe the files will always be encoded in UTF-16. What would be an elegant way to solve this? Decoding the files to UTF-8 as they kick in? — user3216886, Mar 03 '14 at 03:32

Python failed to parse txt file but the file is confirmed to be 'txt' file

1 Answers1