0

I have a pandas dataframe (created by appending several CSV files) with more than 5 million records. I need to use the data for a machine learning model.

I would like to convert it to JSON format so that the data loads faster every time I open my ML code. The code below runs fine without any error.

However, it takes a long time to execute. It is taking the same time as it would to read a huge CSV file. I believe one can read a JSON file with millions of records in a few seconds/minutes. Could anyone suggest how that could be done?

# creating Json file
result = dfcustdata.to_json('custdata.json', indent= 1, orient= 'records') 

#reading into dataframe
dffinalcustdata = pd.read_json('custdata.json')

Imp Update - I figured out a way to import huge CSVs very fast without converting to JSON. Here is the code (you can tinker with the chunksize). The code imports in chunks and then appends them all in the final dataframe df:

tp = pd.read_csv('custdata.csv', iterator=True, chunksize=2000)
df = pd.concat(tp, ignore_index=True)
seraphis
  • 11
  • 3
  • Are you able to read the file `custdata.json` in the same script as the one where you write to it? That should save a ton of IO waiting. You can just reuse `result` in that case – InsertCheesyLine Jun 02 '22 at 07:23
  • 1
    Why do you think `JSON` would be faster than CSV? It's a more complex format and less information dense so it should take longer. The Pickle or some other binary format might be faster. – mousetail Jun 02 '22 at 07:31
  • Especially since you have the `indent=1` option that significantly increases the file size even more – mousetail Jun 02 '22 at 07:34
  • if you _don't really need to_ read the whole dataframe, consider [reading a small batch](https://stackoverflow.com/questions/25962114/how-do-i-read-a-large-csv-file-with-pandas) of it. read more about [batch training](https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/) – AcaNg Jun 02 '22 at 08:04
  • Please see my update. I figured out a way to import huge CSV (mine has more than 5 million rows) without having to convert to JSON. I have added the code in my update. – seraphis Jun 02 '22 at 10:30
  • try to use read_csv(engine="pyarrow"), it will make better performance. – Hardik Gajjar Nov 15 '22 at 13:41

0 Answers0