0

I have a large JSON file about tens of GB, I adopted reading the file line by line, but it is slow to write about 10M for 1 minute. help me to modify the code

this is my code



import pandas as pd
import json

add_header = True

with open('1.json') as f_json:
    for line in f_json:
        line = line.strip()
        
        df = pd.json_normalize(json.loads(line))
        df.to_csv('1.csv', index=None, mode='a', header=add_header)
        add_header = False

I also tried to use chunked reading, but got an error, the code:

import pandas as pd
import json


data = pd.read_json('G:\\1.json',
                           encoding='utf8',lines=True,chunksize=100000)
for df in data:
    line = df.strip()
    df = pd.json_normalize(json.loads(line))
    df.to_csv('G:\\1.csv', index=None, mode='a',encoding='utf8')

output

    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'strip'
Process finished with exit code -1

Here is my JSON file

{"_index":"core-bvd-dmc","_type":"_doc","_id":"e22762d5c4b81fbcad62b5c1d77226ec","_score":1,"_source":{"a_id":"P305906272","a_id_type":"Contact ID","a_name":"Mr Chuanzong Chen","a_name_normal":"MR CHUANZONG CHEN","a_job_title":"Executive director and general manager","relationship":"Currently works for (Executive director and general manager)","b_id":"CN9390051924","b_id_type":"BVD ID","b_name":"Yantai haofeng trade co., ltd.","b_name_normal":"YANTAI HAOFENG TRADE CO","b_country_code":"CN","b_country":"China","b_in_compliance_db":false,"b_nationality":"CN","b_street_address":"Bei da jie 53hao 1609shi; Zhi fu qu","b_city":"Yantai","b_postcode":"264000","b_region":"East China|Shandong","b_phone":"+86 18354522200","b_email":"18354522200@163.com","b_latitude":37.511873,"b_longitude":121.396883,"b_geo_accuracy":"Community","b_national_ids":{"Unified social credit code":["91370602073035263P"],"Trade register number":["370602200112047"],"NOC":["073035263"]},"dates":{"date_of_birth":null},"file_name":"/media/hedwig/iforce/data/BvD/s3-transfer/SuperTable_v3_json/dmc/part-00020-7b09c546-2adc-413e-9e68-18b300e205cf-c000.json","b_geo_point":{"lat":37.511873,"lon":121.396883}}}
{"_index":"core-bvd-dmc","_type":"_doc","_id":"97871f8842398794e380a748f5b82ea5","_score":1,"_source":{"a_id":"P305888975","a_id_type":"Contact ID","a_name":"Mr Hengchao Jiang","a_name_normal":"MR HENGCHAO JIANG","a_job_title":"Legal representative","relationship":"Currently works for (Legal representative)","b_id":"CN9390053357","b_id_type":"BVD ID","b_name":"Yantai ji hong educate request information co., ltd.","b_name_normal":"YANTAI JI HONG EDUCATE REQUEST INFORMATION CO","b_country_code":"CN","b_country":"China","b_in_compliance_db":false,"b_nationality":"CN","b_street_address":"Ying chun da jie 131hao nei 1hao; Lai shan qu","b_city":"Yantai","b_postcode":"264000","b_region":"East China|Shandong","b_phone":"+86 18694982900","b_email":"xyw_700@163.com","b_latitude":37.511873,"b_longitude":121.396883,"b_geo_accuracy":"Community","b_national_ids":{"NOC":["597807789"],"Trade register number":["370613200023836"],"Unified social credit code":["913706135978077898"]},"dates":{"date_of_birth":null},"file_name":"/media/hedwig/iforce/data/BvD/s3-transfer/SuperTable_v3_json/dmc/part-00020-7b09c546-2adc-413e-9e68-18b300e205cf-c000.json","b_geo_point":{"lat":37.511873,"lon":121.396883}}}

Read code in chunks,I set number of blocks=1

import pandas as pd

add_header = True
data = pd.read_json('G:\\1.json',
                           encoding='utf8',lines=True,chunksize=1)
for subdf in data:
    # subdf is already a dataframe.
    temp_df = pd.concat([subdf[['_index', '_id', '_score']], pd.json_normalize(subdf._source)], axis=1)
    temp_df.to_csv('G:\\1.csv', index=None, header=add_header,mode='a',encoding='utf8')
    add_header = False

output result,I converted 2 lines of JSON file, the first line is converted correctly, but the second line of JSON data has errors, the information of the keys '_index', '_id', '_score' and the information in "_source" are separated

_index,_id,_score,a_id,a_id_type,a_name,a_name_normal,a_job_title,relationship,b_id,b_id_type,b_name,b_name_normal,b_country_code,b_country,b_in_compliance_db,b_nationality,b_street_address,b_city,b_postcode,b_region,b_phone,b_email,b_latitude,b_longitude,b_geo_accuracy,file_name,b_national_ids.Unified social credit code,b_national_ids.Trade register number,b_national_ids.NOC,dates.date_of_birth,b_geo_point.lat,b_geo_point.lon
core-bvd-dmc,e22762d5c4b81fbcad62b5c1d77226ec,1,P305906272,Contact ID,Mr Chuanzong Chen,MR CHUANZONG CHEN,Executive director and general manager,Currently works for (Executive director and general manager),CN9390051924,BVD ID,"Yantai haofeng trade co., ltd.",YANTAI HAOFENG TRADE CO,CN,China,False,CN,Bei da jie 53hao 1609shi; Zhi fu qu,Yantai,264000,East China|Shandong,+86 18354522200,18354522200@163.com,37.511873,121.396883,Community,/media/hedwig/iforce/data/BvD/s3-transfer/SuperTable_v3_json/dmc/part-00020-7b09c546-2adc-413e-9e68-18b300e205cf-c000.json,['91370602073035263P'],['370602200112047'],['073035263'],,37.511873,121.396883
,,,P305888975,Contact ID,Mr Hengchao Jiang,MR HENGCHAO JIANG,Legal representative,Currently works for (Legal representative),CN9390053357,BVD ID,"Yantai ji hong educate request information co., ltd.",YANTAI JI HONG EDUCATE REQUEST INFORMATION CO,CN,China,False,CN,Ying chun da jie 131hao nei 1hao; Lai shan qu,Yantai,264000,East China|Shandong,+86 18694982900,xyw_700@163.com,37.511873,121.396883,Community,/media/hedwig/iforce/data/BvD/s3-transfer/SuperTable_v3_json/dmc/part-00020-7b09c546-2adc-413e-9e68-18b300e205cf-c000.json,['597807789'],['370613200023836'],['913706135978077898'],,37.511873,121.396883
core-bvd-dmc,97871f8842398794e380a748f5b82ea5,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


jeri teri
  • 73
  • 8
  • You are using `read_csv` on a JSON file. Typo or a mistake? Without looking at your JSON, what about something like: pd_obj = pd.read_json('my_file.json', orient='index'). Then, pd_obj .to_csv('my_file.csv', index=False). Else, please provide a sample of your json. – sitWolf Feb 03 '22 at 14:47
  • You are reading a JSON file with `read_csv`? What does 1 line in your file look like? – Jan Jaap Meijerink Feb 03 '22 at 14:48
  • I updated the JSON file – jeri teri Feb 03 '22 at 15:00
  • hope that the information in that json file you posted is fake and not real information. –  Feb 03 '22 at 15:10

1 Answers1

1

chunksize is returning the iterator of dataframes, so you cannot do strip or json.loads on that.

You would probably need

for subdf in data:
    # subdf is already a dataframe.  
    temp_df = pd.concat([
        subdf[['_index', '_id', '_score']].reset_index(drop=True), 
        pd.json_normalize(subdf._source)
    ], axis=1)
    temp_df.to_csv(filename, index=None, mode='a',encoding='utf8')

You can modify the pd.concat line to flatten/extract the data you want but I hope you get the idea.

Another thought I have is that although, csv can hold large data better than JSON but would you consider chunk the output csv into multiple files instead of creating a huge csv?

Emma
  • 8,518
  • 1
  • 18
  • 35
  • Looks like it works, but it turns a 5 million line JSON file into a 9 million line CSV file, 4 million extra lines – jeri teri Feb 04 '22 at 01:51
  • Did you delete the csv file before you run each time? You have `mode='a'`, that will append contents to the existing file. If this is not the case, I would try reading the csv file to see what actually is the extra rows in the file. – Emma Feb 04 '22 at 02:00
  • yes i deleted the csv file before each run, i tried several times, – jeri teri Feb 04 '22 at 02:23
  • The other thing is header is on by default and each time of `to_csv`, it is appending the header again. Though, this is not adding up to 4 mil. 5mil with chunksize = 100k should be only 50 iterations. could you check the csv to see what is extra? and if you still have trouble, please edit your question to add your current code. – Emma Feb 04 '22 at 04:40
  • `import pandas as pd add_header = True file_data = pd.read_json('G:\\1.json', encoding='utf8',lines=True,chunksize=1) for subdf in file_data : # subdf is already a dataframe. temp_df = pd.concat([subdf[['_index', '_id', '_score']], pd.json_normalize(subdf._source)], axis=1) temp_df.to_csv('G:\\1.csv', index=None, mode='a',encoding='utf8',header=add_header) add_header = False` – jeri teri Feb 04 '22 at 05:04
  • Two lines of JSON file get 3 lines of CSV file and a header – jeri teri Feb 04 '22 at 05:08
  • Two lines of JSON file get 3 lines of CSV file and a header,The first line of csv is completely converted, the second line only contains the information in "_source", and the third line only contains the information in '_index', '_id', '_score', – jeri teri Feb 04 '22 at 05:16
  • I updated the topic, you can see the details – jeri teri Feb 04 '22 at 06:05
  • I updated the answer. I am sorry I forgot about the index. ref: https://stackoverflow.com/a/55726005/2956135 – Emma Feb 04 '22 at 15:43
  • yes it is working, how can i modify the code if i want to get the specified key – jeri teri Feb 05 '22 at 14:13
  • `temp_df.to_csv(filename, index=None, columns = {"a_name",},mode='a',encoding='utf8')`will this affect? – jeri teri Feb 05 '22 at 15:08
  • Did it work as expected? – Emma Feb 07 '22 at 15:19
  • 1
    yes it works fine – jeri teri Feb 07 '22 at 15:32