Thanks for your help in advance. I am trying to read a JSON file into a pandas DataFrane and getting a cornucopia of unicode/ascii errors. Edit: The error appears to lie in the fact that the JSON file is multi line with each line its own JSON object.
With a data file that looks like:
"data.json" =
{"_i":{"$o":"5b"},"c_id":"10","p_id":"10","c_c":2,"l_c":59,"u":{"n":"J","id":"1"},"c_t":"2010","m":"Hopefully \n\nEDIT: Actually."}
{"_i":{"$o":"5b"},"p_id":"10","c_id":"10","p_id":"10","c_c":0,"l_c":8,"u":{"n":"S","id":"1"},"c_t":"2010","m":"in-laws?"}
Edit: In response to a comment, the above is not code to be run, it is included as a sample of my datafile, that is saved as a json file.
As this is a multiple line file, per this link Loading a file with more than one line of JSON into Python's Pandas I tried to use
import pandas
df = pandas.read_json('data.json', lines = True)
Gives the error:
json = u'[' + u','.join(lines) + u']'
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 436: ordinal not in range(128)
According to this issue highlighted on GitHub https://github.com/pandas-dev/pandas/issues/15132, this is because:
This can happen in Python 2.7 if the default encoding is set to ascii (check sys.getdefaultencoding()). StringIO will convert the input string to ascii when lines=True, resulting in a UnicodeDecodeError because of mixing utf-8 and ascii strings.
Their solution is to change the system encoding to utf-8
from ascii
, however, I understand that this is inadvisable - source:Changing default encoding of Python?.
I also tried changing the encoding both to utf-8
/ ascii
within read_json()
but to no avail.
How can I successfully read this json file into a pandas DataFrame, preserving the multi-line structure?
Many thanks!