0

my code, My code, which succeeds when I process a single line JSON file, fails when I process more than two lines, my json file is around 100k lines.

import json
import pandas as pd
file_data = open("F:\\1.json",'r').read()
data= json.loads(file_data)
df = pd.json_normalize(data)
df.to_csv('F:\\1.csv', index=False)

my json format

{"_index":"core-bvd-dmc","_type":"_doc","_id":"e22762d5c4b81fbcad62b5c1d77226ec","_score":1,"_source":{"a_id":"P305906272","a_id_type":"Contact ID","a_name":"Mr Chuanzong Chen","a_name_normal":"MR CHUANZONG CHEN","a_job_title":"Executive director and general manager","relationship":"Currently works for (Executive director and general manager)","b_id":"CN9390051924","b_id_type":"BVD ID","b_name":"Yantai haofeng trade co., ltd.","b_name_normal":"YANTAI HAOFENG TRADE CO","b_country_code":"CN","b_country":"China","b_in_compliance_db":false,"b_nationality":"CN","b_street_address":"Bei da jie 53hao 1609shi; Zhi fu qu","b_city":"Yantai","b_postcode":"264000","b_region":"East China|Shandong","b_phone":"+86 18354522200","b_email":"18354522200@163.com","b_latitude":37.511873,"b_longitude":121.396883,"b_geo_accuracy":"Community","b_national_ids":{"Unified social credit code":["91370602073035263P"],"Trade register number":["370602200112047"],"NOC":["073035263"]},"dates":{"date_of_birth":null},"file_name":"/media/hedwig/iforce/data/BvD/s3-transfer/SuperTable_v3_json/dmc/part-00020-7b09c546-2adc-413e-9e68-18b300e205cf-c000.json","b_geo_point":{"lat":37.511873,"lon":121.396883}}}
{"_index":"core-bvd-dmc","_type":"_doc","_id":"97871f8842398794e380a748f5b82ea5","_score":1,"_source":{"a_id":"P305888975","a_id_type":"Contact ID","a_name":"Mr Hengchao Jiang","a_name_normal":"MR HENGCHAO JIANG","a_job_title":"Legal representative","relationship":"Currently works for (Legal representative)","b_id":"CN9390053357","b_id_type":"BVD ID","b_name":"Yantai ji hong educate request information co., ltd.","b_name_normal":"YANTAI JI HONG EDUCATE REQUEST INFORMATION CO","b_country_code":"CN","b_country":"China","b_in_compliance_db":false,"b_nationality":"CN","b_street_address":"Ying chun da jie 131hao nei 1hao; Lai shan qu","b_city":"Yantai","b_postcode":"264000","b_region":"East China|Shandong","b_phone":"+86 18694982900","b_email":"xyw_700@163.com","b_latitude":37.511873,"b_longitude":121.396883,"b_geo_accuracy":"Community","b_national_ids":{"NOC":["597807789"],"Trade register number":["370613200023836"],"Unified social credit code":["913706135978077898"]},"dates":{"date_of_birth":null},"file_name":"/media/hedwig/iforce/data/BvD/s3-transfer/SuperTable_v3_json/dmc/part-00020-7b09c546-2adc-413e-9e68-18b300e205cf-c000.json","b_geo_point":{"lat":37.511873,"lon":121.396883}}}


When I convert one of the lines of the JSON file alone it succeeds, but when I convert two or more lines it gives an error.

 File "C:\Users\jeri\AppData\Local\Programs\Python\Python39\lib\json\decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 1184)

Process finished with exit code 1
jeri teri
  • 73
  • 8

1 Answers1

0
import json
import pandas as pd

file_path = "your_file_name"
with open(file_path, 'r') as fh:
    file_data = fh.readlines()

all_data = []
for data in file_data:
    data = data.strip()
    if data:
        all_data.append(json.loads(data))
df = pd.json_normalize(all_data)
df

Load using Iterator

all_data=[]
for line in open('sample.txt'):
    data = line.strip()
    if data:
        all_data.append(json.loads(data))
Mazhar
  • 1,044
  • 6
  • 11
  • My json file has 100 thousand lines about 0.5G, your modification is not applicable – jeri teri Feb 03 '22 at 07:00
  • Previously I have just use string variable as reference, surely you can replace it with your file read operation. – Mazhar Feb 03 '22 at 07:27
  • yes your updated code helped me – jeri teri Feb 03 '22 at 07:46
  • When reading a large file, it will prompt "memory error". If I want to read line by line or read in blocks, how should I modify it? – jeri teri Feb 03 '22 at 08:00
  • But again you accumulating all data in single variable```all_data``` which can cause memory error too. Have a look on the updated portion. One solution you can read by chunk size, but it ends chunks anywhere that cuases(single json not completed) https://stackoverflow.com/questions/8009882/how-to-read-a-large-file-line-by-line – Mazhar Feb 03 '22 at 08:13
  • So they can only be loaded completely, no iterators can be used? – jeri teri Feb 03 '22 at 09:32