I have a json file with several keys. I want to use one of the keys and write that string to a file. The string originally is in unicode. So, I do, s.unicode('utf-8')
Now, there is another key in that json which I write to another file (this is a Machine learning task, am writing original string in one, features in another). The problem is that at the end, the file with the unicode string turns out to have more number of lines (when counted by using "wc -l") and this misguides my tool and it crashes saying sizes not same.
Code for reference:
for line in input_file:
j = json.loads(line)
text = j['text']
label = j[t]
output_file.write(str(label) + '\t' + text.encode('utf-8') + '\n')
norm_file.write(j['normalized'].encode('utf-8') + '\n')
The difference when using "wc -l"
16862965
This is the number of lines I expect and what I get is
16878681
which is actually higher. So I write a script to see how many output labels are actually there
with open(sys.argv[1]) as input_file:
for line in input_file:
p = line.split('\t')
if p[0] not in ("good", "bad"):
print p
else:
c += 1
print c
And, lo and behold, I have 16862965
lines, which means some are wrong. I print them out and I get a bunch of empty new line chars ('\n'). So I guess my question is, "what am i missing when dealing with unicode like this?"
Should I have stripped all leading and trailing spaces (not that there are any in the string)