Unexpected behaviour of t.unicode('utf-8') - Python

Question

I have a json file with several keys. I want to use one of the keys and write that string to a file. The string originally is in unicode. So, I do, s.unicode('utf-8')

Now, there is another key in that json which I write to another file (this is a Machine learning task, am writing original string in one, features in another). The problem is that at the end, the file with the unicode string turns out to have more number of lines (when counted by using "wc -l") and this misguides my tool and it crashes saying sizes not same.

Code for reference:

for line in input_file:
                j = json.loads(line)
                text = j['text']
                label = j[t]

                output_file.write(str(label) + '\t' + text.encode('utf-8') + '\n')
                norm_file.write(j['normalized'].encode('utf-8') + '\n')

The difference when using "wc -l"

16862965

This is the number of lines I expect and what I get is

16878681

which is actually higher. So I write a script to see how many output labels are actually there

with open(sys.argv[1]) as input_file:
        for line in input_file:
                p = line.split('\t')
                if p[0] not in ("good", "bad"):
                   print p
                else:
                   c += 1


print c

And, lo and behold, I have 16862965 lines, which means some are wrong. I print them out and I get a bunch of empty new line chars ('\n'). So I guess my question is, "what am i missing when dealing with unicode like this?" Should I have stripped all leading and trailing spaces (not that there are any in the string)

score 3 · Accepted Answer · answered May 07 '14 at 12:43

JSON strings can't contain literal newlines in them e.g.,

not_a_json_string = '"\n"' # in Python source
json.loads(not_a_json_string) # raises ValueError

but they can contain escaped newlines:

json_string = r'"\n"' # raw-string literal (== '"\\n"')
s = json.loads(json_string)

i.e., the original text (json_string) has no newlines in it (it has the backslash followed by n character -- two characters) but the parsed result does contain the newline: '\n' in s.

That is why the example:

for line in file:
    d = json.loads(line)
    print(d['key'])

may print more lines than the file contains.

It is unrelated to utf-8.

In general, there could also be an issue with non-native newlines e.g., b'\r\r\n\n', or an issue with Unicode newlines such as u'"\u2028 "' (U+2028 LINE SEPARATOR).

Ok...don't know why I missed that, seems obvious now - [newlines in JSON data vs in the JSON string](http://stackoverflow.com/a/42073/1431750). — aneroid, May 08 '14 at 02:14

score 1 · Answer 2 · edited May 23 '17 at 10:32

1

Do the same check you were doing on the files written but before you write them, to see how many values get flagged. And make sure those values don't have '\\n' in them. That may be skewing your count.
For better details, see J.F.'s answer below.

Unrelated-to-your-error notes:

(a) When JSON is loads()ed, str objects are automatically unicode already:

>>> a = '{"b":1}'
>>> json.loads(a)['b']
1
>>> json.loads(a).keys()
[u'b']
>>> type(json.loads(a).keys()[0])
<type 'unicode'>

So str(label) in the file write should be either just label or unicode(label). You shouldn't need to encode text and j['normalized'] when you write them to file. Instead, set the file encoding to 'utf-8' when you open it.

(b) Btw, use format() or join() in the write operations - if any of label, text or j['normalized'] is None, the + operator will give an error.

edited May 23 '17 at 10:32

Community

1
1

answered May 07 '14 at 02:33

aneroid

12,983
3
36
66

1

Thanks for the tips. The problem was in fact with the data. Not sure why some logs had a bunch of "\n" in them. But this helped make the code better! I have been using .format() all over when writing print statements. it's so neat – crazyaboutliv May 07 '14 at 06:14
Note that @j-f-sebastian 's answer below covers more of the actual cause for the error. A couple of SO questions which deal with the same thing: [here](http://stackoverflow.com/a/42073/1431750) and [here](http://stackoverflow.com/a/11591793/1431750); and [google](https://www.google.com/search?q=JSON+newline). – aneroid May 08 '14 at 02:19

Unexpected behaviour of t.unicode('utf-8') - Python

2 Answers2