I have a pandas dataframe (created by appending several CSV files) with more than 5 million records. I need to use the data for a machine learning model.
I would like to convert it to JSON format so that the data loads faster every time I open my ML code. The code below runs fine without any error.
However, it takes a long time to execute. It is taking the same time as it would to read a huge CSV file. I believe one can read a JSON file with millions of records in a few seconds/minutes. Could anyone suggest how that could be done?
# creating Json file
result = dfcustdata.to_json('custdata.json', indent= 1, orient= 'records')
#reading into dataframe
dffinalcustdata = pd.read_json('custdata.json')
Imp Update - I figured out a way to import huge CSVs very fast without converting to JSON. Here is the code (you can tinker with the chunksize). The code imports in chunks and then appends them all in the final dataframe df:
tp = pd.read_csv('custdata.csv', iterator=True, chunksize=2000)
df = pd.concat(tp, ignore_index=True)