1

I've created a program that takes data from a .txt files and uses a series of regular expressions to turn it into useful information that can go into a Pandas Dataframe. When I was testing this program, I was simply copying and pasting bits of the data from the .txt files into variables in Python, rather than uploading the entire .txt files. But now that I've finished all the testing, I can't figure out how to get the .txt files uploaded in a useful way.

I tried in both Google Colab and Jupyter Notebook. Here is the code for JN:

file1 = open("sample_file.txt","r")
file1.readlines()

Unfortunately, I get output that looks like gibberish (though might be hexadecimal).

'ÿþG\x00a\x00m\x00e\x00 \x00s\x00t\x00a\x00r\x00t\x00e\x00d\x00 \x00a\x00t\x00:\x00 \x002\x000\x001\x008\x00/\x007\x00/\x002\x001\x00 \x006\x00:\x003\x003\x00:\x001\x004\x00\n'

How do I fix this and make it readable so my program will run on it?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Ragnar Lothbrok
  • 1,045
  • 2
  • 16
  • 31
  • 1
    At a guess the file is encoded as utf16, but the fragment that you have provided isn't valid utf16 (perhaps you have truncated a larger fragment?). So you try passing `open` an `encoding` argument of `'utf-16-le'` or `'utf-16-be'`. – snakecharmerb Jul 22 '18 at 09:13
  • 'utf-16-le' worked. Thanks! I copied just the first line, so maybe I truncated it. – Ragnar Lothbrok Jul 22 '18 at 09:46

2 Answers2

3

The data in the file is encoded as UTF-16, little-endian. The fact that every other byte is a null byte (\x00) is a strong hint that some variant of UTF-16 is involved.

Pass the open function an encoding argument to decode the data.

file1 = open("sample_file.txt","r", encoding="utf-16-le")
file1.readlines()
snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
2

Try to add encoding to your code.

open(Filename, 'r', encoding='utf-8')

See more here: https://docs.python.org/3/library/functions.html#open

(Copied from this response: Unicode (UTF-8) reading and writing to files in Python)