19

I'm trying to read a text file into python, but it seems to use some very strange encoding. I try the usual:

file = open('data.txt','r')

lines = file.readlines()

for line in lines[0:1]:
    print line,
    print line.split()

Output:

0.0200197   1.97691e-005

['0\x00.\x000\x002\x000\x000\x001\x009\x007\x00', '\x001\x00.\x009\x007\x006\x009\x001\x00e\x00-\x000\x000\x005\x00']

Printing the line works fine, but after I try to split the line so that I can convert it into a float, it looks crazy. Of course, when I try to convert those strings to floats, this produces an error. Any idea about how I can convert these back into numbers?

I put the sample datafile here if you would like to try to load it: https://dl.dropboxusercontent.com/u/3816350/Posts/data.txt

I would like to simply use numpy.loadtxt or numpy.genfromtxt, but they also do not want to deal with this crazy file.

DanHickstein
  • 6,588
  • 13
  • 54
  • 90
  • As a side note, there's almost never a good reason to call `readlines()`. The file itself is already an iterable, so you can just write `for line in file:`. In your case, you're slicing, which won't work on the file object... but you're doing it to just read the first line, so `line=next(file)` will work. Using `readlines` forces python to read the entire file into memory and build a list, wasting time and memory. – abarnert Oct 12 '13 at 00:11
  • Good idea, thanks! I am not used to using readlines(), or iterating over the lines in a file since I prefer to use numpy.loadtxt for loading files like this. Do you think it can handle the crazy encoding? – DanHickstein Oct 12 '13 at 00:12
  • 1
    I believe numpy.loadtxt doesn't have an encoding parameter, but it can take a file-like object like the one io.open or codecs.open will return. However, it may not like Unicode files, so you may have to "transcode" it to ASCII, which means basically putting _two_ wrappers around it--one to decode the UTF-16, the other to encode the result to ASCII. I'll look at the docs and do a test when I get home – abarnert Oct 12 '13 at 00:16
  • Oh, I had no idea you could pass the file object (opened with utf-16-le encoding) to numpy.loadtxt()! I checked it, it works. This is the perfect solution. – DanHickstein Oct 12 '13 at 00:20
  • If that doesn't work (although it sounds like it does) and you're in a hurry: `(line.encode('ascii') for line in file)` (with file being the result of `io.open`) should be acceptable input to `loadtxt`. – abarnert Oct 12 '13 at 00:26

4 Answers4

32

I'm willing to bet this is a UTF-16-LE file, and you're reading it as whatever your default encoding is.

In UTF-16, each character takes two bytes.* If your characters are all ASCII, this means the UTF-16 encoding looks like the ASCII encoding with an extra '\x00' after each character.

To fix this, just decode the data:

print line.decode('utf-16-le').split()

Or do the same thing at the file level with the io or codecs module:

file = io.open('data.txt','r', encoding='utf-16-le')

* This is a bit of an oversimplification: Each BMP character takes two bytes; each non-BMP character is turned into a surrogate pair, with each of the two surrogates taking two bytes. But you probably didn't care about these details.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • This produced some fairly strange output: [u'\u300d\u2e00', u'\u3200', u'\u3100\u3900\u3700\u0900\u3100\u2e00\u3900\u3700\u3600\u3900\u3100\u6500\u2d00', u'\u3500\u0a00'] Maybe you can try to load the file from the link I included? – DanHickstein Oct 11 '13 at 23:55
  • @PeterDeGlopper: Decoding before even looping would be better. I think `io.open('data.txt', encoding='utf-16-le')` will take care of that, but without a computer in front of me I can't verify that, so I left the details out of the answer. – abarnert Oct 11 '13 at 23:56
  • @DanHickstein: oops, looks like this is UTF-16-BE then. Try changing the l to a b. – abarnert Oct 11 '13 at 23:57
  • Using io.open('data.txt', encoding='utf-16-le') did the trick! Thanks for your help! (Maybe add this to your answer and then I'll accept it.) – DanHickstein Oct 12 '13 at 00:01
  • 1
    The advantage of io over codecs is that it's forward-compatible to Python 3.x, and has some bug fixes and performance improvements for various edge cases that will never be added to codecs. The disadvantage is that it's not backward compatible to early 2.x versions, and it's not "bug-compatible" with code that relies on the odd edge case behaviors of codecs, and it doesn't provide some uncommon but sometimes useful functions and types that codecs does. – abarnert Oct 12 '13 at 00:13
  • Interesting! codecs.open('data.txt', encoding='utf-16-le') also works great in this case. – DanHickstein Oct 12 '13 at 00:17
  • @DanHickstein: For simple cases, in Python 2.7, they're pretty much interchangeable. – abarnert Oct 12 '13 at 00:22
3

Looks like UTF-16 to me.

>>> test_utf16 = '0\x00.\x000\x002\x000\x000\x001\x009\x007\x00'
>>> test_utf16.decode('utf-16')
u'0.0200197'

You can work directly off the Unicode strings:

>>> float(test_utf16)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: null byte in argument for float()
>>> float(test_utf16.decode('utf-16'))
0.020019700000000001

Or encode them to something different, if you prefer:

>>> float(test_utf16.decode('utf-16').encode('ascii'))
0.020019700000000001

Note that you need to do this as early as possible in your processing. As your comment noted, split will behave incorrectly on the utf-16 encoded form. The utf-16 representation of the space character ' ' is ' \x00', so split removes the whitespace but leaves the null byte.

The 2.6 and later io library can handle this for you, as can the older codecs library. io handles linefeeds better, so it's preferable if available.

Peter DeGlopper
  • 36,326
  • 7
  • 90
  • 83
  • 1
    The second number lost its first byte to `split` but not its second. The utf-16 representation of u' ' is ' \x00', so `split` damages it. Using `decode` before `split` should work better, though now I want to test whether `readlines` works correctly or not. – Peter DeGlopper Oct 11 '13 at 23:55
  • Good catch on split (and probably readlines) leaving behind an extra nul byte for every second string, meaning decoding too late is too late to help. – abarnert Oct 12 '13 at 00:24
1

This is really just @abarnert's suggestion, but I wanted to post it as an answer since this is the simplest solution and the one that I ended up using:

    file = io.open(filename,'r',encoding='utf-16-le')
    data = np.loadtxt(file,skiprows=8)

This demonstrates how you can create a file object using io.open using whatever crazy encoding your file happens to have, and then pass that file object to np.loadtxt (or np.genfromtxt) for quick-and-easy loading.

DanHickstein
  • 6,588
  • 13
  • 54
  • 90
0

This piece of code will do the necessary

file_handle=open(file_name,'rb')
file_first_line=file_handle.readline()
file_handle.close()
print file_first_line
if '\x00' in file_first_line:
    file_first_line=file_first_line.replace('\x00','')
    print file_first_line

When you try to use 'file_first_line.split()' before replacing, the output would contain '\x00' i just tried replacing '\x00' with empty and it worked.

oliver smith
  • 95
  • 1
  • 4