3

I'm writing a program to 'manually' arrange a csv file to be proper JSON syntax, using a short Python script. From the input file I use readlines() to format the file as a list of rows, which I manipulate and concenate into a single string, which is then outputted into a separate .txt file. The output, however, contains gibberish instead of Hebrew characters that were present in the input file, and the output is double-spaced, horizontally (a whitespace character is added in between each character). As far as I can understand, the problem has to do with the encoding, but I haven't been able to figure out what. When I detect the encoding of the input and output files (using .encoding attribute), they both return None, which means they use the system default. Technical details: Python 2.7, Windows 7.

While there are a number of questions out there on this topic, I didn't find a direct answer to my problem. Detecting the system defaults won't help me in this case, because I need the program to be portable.

Here's the code:

def txt_to_JSON(csv_list):
    ...some manipulation of the list...
    return JSON_string
file_name = "input_file.txt"
my_file = open(file_name)
# make each line of input file a value in a list
lines = my_file.readlines()
# break up each line into a list such that each 'column' is a value in that list 
for i in range(0,len(lines)):
    lines[i] = lines[i].split("\t")
J_string = txt_to_JSON(lines)
json_file = open("output_file.txt", "w+")
json_file.write(jstring)
json_file.close()
ygesher
  • 1,133
  • 12
  • 26
  • It's worth noting that when working with files in Python, it's best to use [the `with` statement](http://www.youtube.com/watch?v=lRaKmobSXF4). – Gareth Latty Apr 24 '13 at 14:48
  • Do you know what's the encoding of the input file? – Paulo Bu Apr 24 '13 at 14:53
  • @PauloBu He's reading Hebrew characters, but he's using ASCII in his program. This is most likely the problem. – Aleph Apr 24 '13 at 15:05
  • What version of Python? – Burhan Khalid Apr 24 '13 at 15:08
  • In general, Python assumes ASCII, you have to specify the input encoding and output encoding when you're working with files encoded in some other encoding. (That sound a little funny :D) – Paulo Bu Apr 24 '13 at 15:11
  • @PauloBu: I don't know the input encoding, and as I noted (perhaps my edit came after your comment), I need it to be portable. – ygesher Apr 24 '13 at 15:44
  • If I `print` the strings I'm working with, it comes out as Unicode. That is, it's already encoded, but I don't know the encoding. If it's a matter of detecting the encoding, I understand that's a tricky and uncertain business, particularly since this program could be used on various platforms, etc... – ygesher Apr 24 '13 at 15:47
  • Open the input file with `notepad`, chose Save As..., in the bottom of the pop-up window where the encoding is, choose UTF-8, save the file. Now you know your input file is utf8 (it should keep the hebrews characters intact) then try to run all the process again with that input. In case it doesn't work, please, add a brief example of the input file to try to parse it here and see if I can. I also have windows/python2.7 – Paulo Bu Apr 24 '13 at 15:55
  • @PauloBu Again, I want this to work on any system. As well the directions I was given for doing the testwork on the program was to use a file saved as unicode, not UTF-8. – ygesher Apr 24 '13 at 15:59
  • @jeg622 Unicode is a superset encoding. UTF-8 is an implementation of that encoding which is the most standardized. Python's unicode's string internally use utf-8. That's why I'm pointing to you to save the file in utf-8. In order to get this working in all systems, you have to get it working in one at least. We will write the code with no system specifics instructions, but first we have to see what's the problem. – Paulo Bu Apr 24 '13 at 16:40
  • @PauloBu If I save the input file as UTF-8, it works perfectly! However, I was instructed to use a file saved as unicode. I will question my team leader about that instruction and get back to you. – ygesher Apr 24 '13 at 17:03
  • 1
    I'm glad. If you want to have some background to explain to your leader these links will be very helpful, specially the first: http://www.joelonsoftware.com/articles/Unicode.html , http://stackoverflow.com/questions/3951722/whats-the-difference-between-unicode-and-utf8 and http://stackoverflow.com/questions/643694/utf-8-vs-unicode – Paulo Bu Apr 24 '13 at 17:18

2 Answers2

1

All data needs to be encoded to be stored on disk. If you don't know the encoding, the best you can do is guess. There's a library for that: https://pypi.python.org/pypi/chardet

I highly recommend Ned Batchelder's presentation http://nedbatchelder.com/text/unipain.html for details.

There's an explanation about the use of "unicode" as an encoding on windows: What's the difference between Unicode and UTF-8?

TLDR: Microsoft uses UTF16 as encoding for unicode strings, but decided to call it "unicode" as they also use it internally.

Even if Python2 is a bit lenient as to string/unicode conversions, you should get used to always decode on input and encode on output.

In your case

filename = 'where your data lives'
with open(filename, 'rb') as f:
   encoded_data = f.read()
decoded_data = encoded_data.decode("UTF16")

# do stuff, resulting in result (all on unicode strings)
result = text_to_json(decoded_data)

encoded_result = result.encode("UTF-16")  #really, just using UTF8 for everything makes things a lot easier
outfile = 'where your data goes'
with open(outfile, 'wb') as f:
    f.write(encoded_result)
Community
  • 1
  • 1
Thomas Fenzl
  • 4,342
  • 1
  • 17
  • 25
  • Thanks for the input. When I do this, however, the output file (created by `f.write()`) is still encoded as ANSI, so I get UnicodeEncodeError when it gets to the Hebrew characters. And btw, utf_16 is the proper notation. – ygesher Apr 25 '13 at 06:29
  • Following your link, I changed the encoding from 'utf_16' to 'utf_16_le', and got a similar error, just relating to very beginning of the file rather than the non-ascii characters. – ygesher Apr 25 '13 at 06:35
  • what program do you use to open the output file? – Thomas Fenzl Apr 25 '13 at 10:42
  • I use notepad. How would this affect the encoding? – ygesher Apr 25 '13 at 13:06
  • the program has to decode the file to interpret what's in it. Can you put both files, or similar files with nonsense original text, someplace? I'd like to take a look – Thomas Fenzl Apr 25 '13 at 16:05
  • After some additional playing around (just in order to get a better understanding of what was going on), I reached additional conclusions: 1. Saving the file as 'unicode' and using `codecs.open()` with `utf_16` or `utf_16_le` as encoding produced a result similar to saving as 'utf-8' and opening it with `open()`, the sole difference being that when I saved it as 'unicode' the output file did not have any line breaks 2. Using `codecs.open()` is fundamentally different from `str.encode()`, but I don't really understand why. – ygesher May 01 '13 at 06:43
  • For the record, [this](https://docs.google.com/file/d/0B-oE2XIgVJv2bmhkREJkU0pMQ0U/edit?usp=sharing) is the input file. The gibberish file can be found [here](https://docs.google.com/file/d/0B-oE2XIgVJv2QkVTRWkxTTlCNDg/edit?usp=sharing), although on my computer the gibberish looked different, and the whole file was double-spaced. The correct file looks like [this](https://docs.google.com/file/d/0B-oE2XIgVJv2YmtaVDdObzhDWE0/edit?usp=sharing). For some reason all the tabs disappeared when I uploaded the files to Drive. – ygesher May 01 '13 at 06:48
  • `codecs.open` using an encoding decodes data from the file while reading and encodes it while writing. So inside your code you only have unicode strings. it's equivalent to `f = open('filename', 'r+'); s = f.read().decode(encoding)` and later `f.write(s.encode(encoding))` so u"asdf".encode is only half the functionality – Thomas Fenzl May 01 '13 at 10:36
0

You need to tell Python to use the Unicode character encoding to decode the Hebrew characters. Here's a link to how you can read Unicode characters in Python: Character reading from file in Python

Community
  • 1
  • 1
  • Sorry, I didn't find a solution there. I tried using the `codecs` module, but nothing changed in the output. – ygesher Apr 24 '13 at 15:56