0

I am working with python and I have a file (data.json) which contains multiple jsons but the whole of it is not a json.

So the file looks like that:

{ "_id" : 01, ..., "path" : "2017-12-12" }
{ "_id" : 02, ..., "path" : "2017-1-12" }
{ "_id" : 03, ..., "path" : "2017-5-12" }

at the place of ... there are about 30 more keys which some of them have nested jsons (so my point is that each json above is pretty long).

Therefore, each of the blocks above at this single file are jsons but the whole of the file is not a json since these are not separated by commas etc.

How can I read each of these jsons separately either with pandas or with simple python?

I have tried this:

import pandas as pd
df = pd.read_json('~/Desktop/data.json', lines=True)

and it actually creates a dataframe where each row is about one json but it also create a column for each of the (1st level) keys of the json which makes things a bit more messy instead of putting the whole json directly in one cell.

To be more clear, I would like my output to be like this in a 'pandas' dataframe (or in another sensible data-structure):

    jsons
0   { "_id" : 01, ..., "path" : "2017-12-12" }
1   { "_id" : 02, ..., "path" : "2017-1-12" }
2   { "_id" : 03, ..., "path" : "2017-5-12" }
Outcast
  • 4,967
  • 5
  • 44
  • 99

2 Answers2

1

Idea is use read_csv with no exist separator in data and then convert each value of column to dictionary:

import pandas as pd
import ast, json
from io import StringIO

temp=u"""{ "_id" : 1,  "path" : "2017-12-12" }
{ "_id" : 2,  "path" : "2017-1-12" }
{ "_id" : 3,  "path" : "2017-5-12" }"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="|", names=['data'])

print (df)

#jsons
df['data'] = df['data'].apply(json.loads)
#dictionaries
#df['data'] = df['data'].apply(ast.literal_eval)

print (df)
                               data
0  {'_id': 1, 'path': '2017-12-12'}
1   {'_id': 2, 'path': '2017-1-12'}
2   {'_id': 3, 'path': '2017-5-12'}
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • cool (upvote), what about that by the way `df = pd.read_csv('~/Desktop/data.json', sep="|", header=None)` ? – Outcast Aug 19 '19 at 11:18
  • @PoeteMaudit - Good solution, difference is new column name is `0`, not `data` – jezrael Aug 19 '19 at 11:19
  • ah ok yes I see. and you use `df['data'] = df['data'].apply(ast.literal_eval)` to convert the content of each cell from `pandas series` to `dict`? – Outcast Aug 19 '19 at 11:22
  • @PoeteMaudit - it convert by contenet, but because here are dictioanry in strings, it convert to dicts. If use `.apply(json.loads)` it convert to jsons – jezrael Aug 19 '19 at 11:25
  • ok, may i see you what you mean. However, to start with I get this error at this line: `raise ValueError('malformed node or string: ' + repr(node)) ValueError: malformed node or string: <_ast.Name object at 0x11a76d748>`. How can I fix this? Or should I try sth pretty different? – Outcast Aug 19 '19 at 11:28
  • @PoeteMaudit - It means there is problem with json - not valid :( – jezrael Aug 19 '19 at 11:29
  • Hm not sure. The jsons separately are fine I think. It is just that the whole file is not a json as I said since these jsons are simply put the one after each other without any commas in between (and a bracket at the beginning and at the end of the whole file). So I do not think that the individual jsons have any problem. Are you sure that you are not missing sth at your code? – Outcast Aug 19 '19 at 11:31
  • 1
    Hm for example this works fine without any errors: https://stackoverflow.com/questions/12451431/loading-and-parsing-a-json-file-with-multiple-json-objects-in-python. You may leave your answer though because it may be useful in some cases etc. – Outcast Aug 19 '19 at 11:42
0

As the file is itself is not a json, so i will read it line by line and as the line is a string format so i will convert it to dict type using yaml then last i will append it all in dataframe

import yaml
import pandas as pd
f = open('data.json')
line = f.readline()
df = pd.DataFrame()

while line:

    #string line to dict
    d = yaml.load(line)

    #temp dataframe
    df1=pd.DataFrame(d,index=[0])

    #append in every iteration
    df=df.append(df1, ignore_index=True)
    line = f.readline()

f.close()
print(df)
#output
  _id        path
0  01  2017-12-12
1  02   2017-1-12
2  03   2017-5-12