0

Here is the program i am developing in python -

Step1 - We will get JSON file ( size could be in GBs e.g 50 GB or more ) from source to our server - Step2 - I use Pandas Dataframe to load JSON in to DF using df = pd.read_json(jsonfile,index=False, header=False

Step3 - I use df.to_csv(temp_csvfile,..)

Steps4 - I use Python psycopg2 to make Postgresql connection and cursor ..

curr=conn.cursor() ```
Step5 - Read the CSV and load using copy_from 
 with open(temp_csvfile,'r') as f:
     curr.copy_from(f,..)
 conn.commit()

I seek feedback on below points -

a. Will this way of loading JSON to Pandas Dataframe not cause out of memory issue if my system memory is < size of the JSON file ..

b. At step 5 again i am opening file in read mode will same issue come here as it might load file in memory ( am i missing anything here )

c. Is there any better way of doing this ..

d. Can Python DASK will be used as it provides reading data in chunks ( i am not familiar with this).

Please advise

CIGEEK
  • 41
  • 7
  • You are right about doubting this approach, and it will lead to 'out of memory' issues certainly. Can you not get the data in multiple json files, instead one huge file. Necessarily the data source system also would be better off if it generates multiple dataset files. – Anand Sowmithiran Nov 13 '21 at 18:07
  • Thanks Anand but right now we can not control Source systems. – CIGEEK Nov 14 '21 at 08:55
  • Ok, have you tried it and did it result in any error, can you share more details ? – Anand Sowmithiran Nov 14 '21 at 08:56
  • Also, you can read such large json files in terms of ```chunk```s, refer to this answer(https://stackoverflow.com/a/67126154/14973743) – Anand Sowmithiran Nov 14 '21 at 09:01
  • Thanks Anand, i am trying the same with pandas chuncksize option.. – CIGEEK Nov 14 '21 at 11:28

1 Answers1

2

You could split your input json file into many smaller files, and also use the chunk size parameter while reading file content into pandas dataframe. Also, use the psycopg2 copy_from function which supports a buffer size parameter. In fact you could use execute_batch() to get batches of rows inserted into your Postgresql table, as in article mentioned in reference below.

References :

  1. Loading 20gb json file in pandas
  2. Loading dataframes data into postgresql table article
  3. Read a large json file into pandas
Anand Sowmithiran
  • 2,591
  • 2
  • 10
  • 22