I have read numerous threads on similar topics on the forum. However, what I am asking here, I believe, it is not a duplicate question.
I am reading a very large dataset (22 gb) of CSV format, having 350 million rows. I am trying to read the dataset in chunks, based on the solution provided by that link.
My current code is as following.
import pandas as pd
def Group_ID_Company(chunk_of_dataset):
return chunk_of_dataset.groupby(['id', 'company'])[['purchasequantity', 'purchaseamount']].sum()
chunk_size = 9000000
chunk_skip = 1
transactions_dataset_DF = pd.read_csv('transactions.csv', skiprows = range(1, chunk_skip), nrows = chunk_size)
Group_ID_Company(transactions_dataset_DF.reset_index()).to_csv('Group_ID_Company.csv')
for i in range(0, 38):
chunk_skip += chunk_size;
transactions_dataset_DF = pd.read_csv('transactions.csv', skiprows = range(1, chunk_skip), nrows = chunk_size)
Group_ID_Company(transactions_dataset_DF.reset_index()).to_csv('Group_ID_Company.csv', mode = 'a', header = False)
There is no issue with the code, it runs fine. But, it, groupby(['id', 'company'])[['purchasequantity', 'purchaseamount']].sum()
only runs for 9000000 rows, which is the declared as chunk_size
. Whereas, I need to run that statement for the entire dataset, not chunk by chunk.
Reason for that is, when it is run chunk by chunk, only one chunk get processed, however, there are a lot of other rows which are scattered all over the dataset and get left behind into another chunk.
A possible solution is to run the code again on the newly generated "Group_ID_Company.csv". By doing so, code will go through new dataset once again and sum()
the required columns. However, I am thinking may be there is another (better) way of achieving that.