1

I need to convert JSON into a csv file, the JSON file is very large, so I use chunked reading, the console does not prompt any errors, But the csv file is blank nothing is written,

import pandas as pd

file_data = pd.read_json('G:\\1.json',
                         encoding='utf8',lines=True,chunksize=100000)

for subdf in file_data:
 df = pd.json_normalize(subdf)
 df.to_csv('F:\\1.csv',encoding='utf-8', mode='a',index=False)

my JSON file

{"_index":"core-bvd-dmc","_type":"_doc","_id":"e22762d5c4b81fbcad62b5c1d77226ec","_score":1,"_source":{"a_id":"P305906272","a_id_type":"Contact ID","a_name":"Mr Chuanzong Chen","a_name_normal":"MR CHUANZONG CHEN","a_job_title":"Executive director and general manager","relationship":"Currently works for (Executive director and general manager)","b_id":"CN9390051924","b_id_type":"BVD ID","b_name":"Yantai haofeng trade co., ltd.","b_name_normal":"YANTAI HAOFENG TRADE CO","b_country_code":"CN","b_country":"China","b_in_compliance_db":false,"b_nationality":"CN","b_street_address":"Bei da jie 53hao 1609shi; Zhi fu qu","b_city":"Yantai","b_postcode":"264000","b_region":"East China|Shandong","b_phone":"+86 18354522200","b_email":"18354522200@163.com","b_latitude":37.511873,"b_longitude":121.396883,"b_geo_accuracy":"Community","b_national_ids":{"Unified social credit code":["91370602073035263P"],"Trade register number":["370602200112047"],"NOC":["073035263"]},"dates":{"date_of_birth":null},"file_name":"/media/hedwig/iforce/data/BvD/s3-transfer/SuperTable_v3_json/dmc/part-00020-7b09c546-2adc-413e-9e68-18b300e205cf-c000.json","b_geo_point":{"lat":37.511873,"lon":121.396883}}}
{"_index":"core-bvd-dmc","_type":"_doc","_id":"97871f8842398794e380a748f5b82ea5","_score":1,"_source":{"a_id":"P305888975","a_id_type":"Contact ID","a_name":"Mr Hengchao Jiang","a_name_normal":"MR HENGCHAO JIANG","a_job_title":"Legal representative","relationship":"Currently works for (Legal representative)","b_id":"CN9390053357","b_id_type":"BVD ID","b_name":"Yantai ji hong educate request information co., ltd.","b_name_normal":"YANTAI JI HONG EDUCATE REQUEST INFORMATION CO","b_country_code":"CN","b_country":"China","b_in_compliance_db":false,"b_nationality":"CN","b_street_address":"Ying chun da jie 131hao nei 1hao; Lai shan qu","b_city":"Yantai","b_postcode":"264000","b_region":"East China|Shandong","b_phone":"+86 18694982900","b_email":"xyw_700@163.com","b_latitude":37.511873,"b_longitude":121.396883,"b_geo_accuracy":"Community","b_national_ids":{"NOC":["597807789"],"Trade register number":["370613200023836"],"Unified social credit code":["913706135978077898"]},"dates":{"date_of_birth":null},"file_name":"/media/hedwig/iforce/data/BvD/s3-transfer/SuperTable_v3_json/dmc/part-00020-7b09c546-2adc-413e-9e68-18b300e205cf-c000.json","b_geo_point":{"lat":37.511873,"lon":121.396883}}}

jeri teri
  • 73
  • 8

1 Answers1

0

I think the big issues is that json_normalize() takes a dict, not a Pandas object. Since your file is line-delimited JSON, you can just load the JSON file and iterate the lines, and then JSON-decode each line, normalize, and convert to CSV:

import json
import pandas as pd

with open('input.jsonl') as json_file:
    write_header = True
    for line in json_file:
        if not line.strip:
            continue  # skip blank lines (last, empty line)

        data = json.loads(line)
        df = pd.json_normalize(data)
        df.to_csv('df.csv', mode='a', index=False, header=write_header)
        write_header = False

Running that against your sample JSON, I get:

_index,_type,_id,_score,_source.a_id,_source.a_id_type,_source.a_name,_source.a_name_normal,_source.a_job_title,_source.relationship,_source.b_id,_source.b_id_type,_source.b_name,_source.b_name_normal,_source.b_country_code,_source.b_country,_source.b_in_compliance_db,_source.b_nationality,_source.b_street_address,_source.b_city,_source.b_postcode,_source.b_region,_source.b_phone,_source.b_email,_source.b_latitude,_source.b_longitude,_source.b_geo_accuracy,_source.b_national_ids.Unified social credit code,_source.b_national_ids.Trade register number,_source.b_national_ids.NOC,_source.dates.date_of_birth,_source.file_name,_source.b_geo_point.lat,_source.b_geo_point.lon
core-bvd-dmc,_doc,e22762d5c4b81fbcad62b5c1d77226ec,1,P305906272,Contact ID,Mr Chuanzong Chen,MR CHUANZONG CHEN,Executive director and general manager,Currently works for (Executive director and general manager),CN9390051924,BVD ID,"Yantai haofeng trade co., ltd.",YANTAI HAOFENG TRADE CO,CN,China,False,CN,Bei da jie 53hao 1609shi; Zhi fu qu,Yantai,264000,East China|Shandong,+86 18354522200,18354522200@163.com,37.511873,121.396883,Community,['91370602073035263P'],['370602200112047'],['073035263'],,/media/hedwig/iforce/data/BvD/s3-transfer/SuperTable_v3_json/dmc/part-00020-7b09c546-2adc-413e-9e68-18b300e205cf-c000.json,37.511873,121.396883
core-bvd-dmc,_doc,97871f8842398794e380a748f5b82ea5,1,P305888975,Contact ID,Mr Hengchao Jiang,MR HENGCHAO JIANG,Legal representative,Currently works for (Legal representative),CN9390053357,BVD ID,"Yantai ji hong educate request information co., ltd.",YANTAI JI HONG EDUCATE REQUEST INFORMATION CO,CN,China,False,CN,Ying chun da jie 131hao nei 1hao; Lai shan qu,Yantai,264000,East China|Shandong,+86 18694982900,xyw_700@163.com,37.511873,121.396883,Community,['597807789'],['370613200023836'],['913706135978077898'],,/media/hedwig/iforce/data/BvD/s3-transfer/SuperTable_v3_json/dmc/part-00020-7b09c546-2adc-413e-9e68-18b300e205cf-c000.json,37.511873,121.396883

Also, if you can figure out json_normalize() and get it to work with read_json(), then you should check out the docs and "line-delimited-json" (my emphasis):

For line-delimited json files, pandas can also return an iterator which reads in chunksize lines at a time. This can be useful for large files or to read from a stream.

So your initializer, with chunksize=100000, is telling Pandas to read 100K lines at a time... seems to go against your intent of being minimal.

Zach Young
  • 10,137
  • 4
  • 32
  • 53
  • Hey, looks like this answer helped, and you're using it in this question over here, https://stackoverflow.com/questions/70985930/python-json-to-csv-csv-repeat-loop-write. Will you please accept this answer? – Zach Young Feb 04 '22 at 19:20