0

Thanks for your help in advance. I am trying to read a JSON file into a pandas DataFrane and getting a cornucopia of unicode/ascii errors. Edit: The error appears to lie in the fact that the JSON file is multi line with each line its own JSON object.

With a data file that looks like:

"data.json" = 

{"_i":{"$o":"5b"},"c_id":"10","p_id":"10","c_c":2,"l_c":59,"u":{"n":"J","id":"1"},"c_t":"2010","m":"Hopefully \n\nEDIT: Actually."}
{"_i":{"$o":"5b"},"p_id":"10","c_id":"10","p_id":"10","c_c":0,"l_c":8,"u":{"n":"S","id":"1"},"c_t":"2010","m":"in-laws?"}

Edit: In response to a comment, the above is not code to be run, it is included as a sample of my datafile, that is saved as a json file.

As this is a multiple line file, per this link Loading a file with more than one line of JSON into Python's Pandas I tried to use

import pandas
df = pandas.read_json('data.json', lines = True)

Gives the error:

    json = u'[' + u','.join(lines) + u']'
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 436: ordinal not in range(128)

According to this issue highlighted on GitHub https://github.com/pandas-dev/pandas/issues/15132, this is because:

This can happen in Python 2.7 if the default encoding is set to ascii (check sys.getdefaultencoding()). StringIO will convert the input string to ascii when lines=True, resulting in a UnicodeDecodeError because of mixing utf-8 and ascii strings.

Their solution is to change the system encoding to utf-8 from ascii, however, I understand that this is inadvisable - source:Changing default encoding of Python?.

I also tried changing the encoding both to utf-8 / ascii within read_json() but to no avail.

How can I successfully read this json file into a pandas DataFrame, preserving the multi-line structure?

Many thanks!

Community
  • 1
  • 1
Chuck
  • 3,664
  • 7
  • 42
  • 76
  • Without an [Minimal, Complete, and Verifiable](http://stackoverflow.com/help/mcve) example this will be hard for someone to help you. – Stephen Rauch Mar 14 '17 at 06:49
  • @StephenRauch Thanks for your comment. I thought I had provided as such - I provided the source code, data file, what outcome I expect, and the outcome I am getting. What else would you like to see? (That is a genuine question, not said sarcastically) – Chuck Mar 14 '17 at 09:34
  • Fundamental problem is that there is nothing here that can be cut and paste to be run. The very first code block is a syntax error. Many will read no further than that. – Stephen Rauch Mar 14 '17 at 13:16
  • Thanks for your comment. So the first code block provides a sample of the data - I thought this was obvious with `"With a data file that looks like:"`Apparently not. Clearly, one is not able to copy and past this as a Json file. It was instead provided so that, if someone wanted to test my code, they would be able to save this data and reproduce my error. Apparently I didn't make this clear enough and should have been more verbose - I have edited this information into the question. Thanks for your suggestion. – Chuck Mar 14 '17 at 13:51
  • It is not the lack of information that is the problem. It is the difficulty of taking that information and trying it out to see if someone can help you. Maybe this will help with understanding the issue: https://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question/ – Stephen Rauch Mar 14 '17 at 13:58
  • Ah OK. I fully understand that actually - particularly in this case, as I cannot make my data file publicly available, it does therefore add an extra layer of difficulty for an answerer. I did not expect much, instead I was hoping someone who has encountered a similar error (and potentially a workaround for the bug highlighted) may have something to input. I appreciate your feedback. – Chuck Mar 14 '17 at 14:02

2 Answers2

2

People are so cranky on here sometimes. Ok so in python 2.7, it defaults to ascii and you can use the following line to see that:

encoding = sys.getdefaultencoding()
print encoding

It appears that they made a fix for this in pandas by allowing you to set the encoding like:

pd.read_json(the_file, encoding = encoding)

Unfortunately, that line doesn't seem to work either.

So rather than depend on pandas we can do it ourselves as well. All the "lines" option does is tack on square brackets at the end and join by commas ( i.e. [{},{},{}] ).

First, read in the data and strip it:

with open(path+theFile, 'rb') as f:
    data = f.readlines()

data = map(lambda x: x.rstrip(), data)

Python read in the lines no problem with encoding. We can then use the same code from pandas to do the lines:

data_lines = "[" + ','.join(data) + "]"

Then read the lines into the parser like normal:

df = pd.read_json(data_lines)

BTW, none of this is an issue in python 3

user3023715
  • 1,539
  • 2
  • 11
  • 12
  • Apologies for the delay on this response. Although I solved this issue at the time (see my posted answer) I have gone through and checked your solution against mine. It is most excellent - whereas mine saves everything as unicode, your answer has the effect of preserving the original data types. This is important as I had a lot of `datetime` `timestamp` information. While your answer takes marginally longer (0.45s vs 0.35s), I believe this does not outway the data type benefit. Basically your answer works great! Thank you very much for taking the time to write it - I really appreciate it! :) – Chuck May 02 '17 at 08:27
0

Eventually, I ended up using the answer here https://stackoverflow.com/a/34463368/2254228 making use of json.loads() to save the output of the file line by line.

My code thus became:

import pandas
import json

data=[]
with open('data.json') as f:
    for line in f:
        data.append(json.loads(line))
df = pd.DataFrame(data)
Community
  • 1
  • 1
Chuck
  • 3,664
  • 7
  • 42
  • 76