How to processes the extremely large dataset into chunks in Python (Pandas), while considering the full dataset for application of function?

Question

I have read numerous threads on similar topics on the forum. However, what I am asking here, I believe, it is not a duplicate question.

I am reading a very large dataset (22 gb) of CSV format, having 350 million rows. I am trying to read the dataset in chunks, based on the solution provided by that link.

My current code is as following.

import pandas as pd

def Group_ID_Company(chunk_of_dataset):
    return chunk_of_dataset.groupby(['id', 'company'])[['purchasequantity', 'purchaseamount']].sum()

chunk_size = 9000000
chunk_skip = 1

transactions_dataset_DF = pd.read_csv('transactions.csv', skiprows = range(1, chunk_skip), nrows = chunk_size)
Group_ID_Company(transactions_dataset_DF.reset_index()).to_csv('Group_ID_Company.csv')

for i in range(0, 38):
    chunk_skip += chunk_size;
    transactions_dataset_DF = pd.read_csv('transactions.csv', skiprows = range(1, chunk_skip), nrows = chunk_size)
    Group_ID_Company(transactions_dataset_DF.reset_index()).to_csv('Group_ID_Company.csv', mode = 'a', header = False)

There is no issue with the code, it runs fine. But, it, groupby(['id', 'company'])[['purchasequantity', 'purchaseamount']].sum() only runs for 9000000 rows, which is the declared as chunk_size. Whereas, I need to run that statement for the entire dataset, not chunk by chunk.

Reason for that is, when it is run chunk by chunk, only one chunk get processed, however, there are a lot of other rows which are scattered all over the dataset and get left behind into another chunk.

A possible solution is to run the code again on the newly generated "Group_ID_Company.csv". By doing so, code will go through new dataset once again and sum() the required columns. However, I am thinking may be there is another (better) way of achieving that.

Bilal Mirza · Answer 1 · 2020-12-05T08:35:11.370

The answer form MarianD worked perfectly, I am answering to share the solution code here.

Moreover, DASK is able to utilize all cores equally, whereas, Pandas was using the only one core to 100%. So, that's the another benefit of DASK, I have noticed over Pandas.

import dask.dataframe as dd

transactions_dataset_DF = dd.read_csv('transactions.csv')
Group_ID_Company_DF = transactions_dataset_DF.groupby(['id', 'company'])[['purchasequantity', 'purchaseamount']].sum().compute()
Group_ID_Company_DF.to_csv('Group_ID_Company.csv')

# to clear the memory
transactions_dataset_DF = None
Group_ID_Company_DF = None

DASK has been able to read all 350 million rows (20 GB of dataset) at once. Which was not achieved by Pandas previously, I had to create 37 chunks to process the entire dataset and it took almost 2 hours to complete the processing using Pandas.

However, with the DASK, it only took around 20 mins to (at once) read, process and then save the new dataset.

Nice from you to share your experience (and your code, too) here. Not everyone has an opportunity to work with such large datasets. — MarianD, Dec 04 '20 at 21:13

MarianD · Accepted Answer · 2020-12-03T17:11:55.013

1

The solution for your problem is probably Dask. You may watch the introductory video, read examples and try them online in a live session (in JupyterLab).

edited Dec 03 '20 at 17:11

answered Dec 03 '20 at 17:06

MarianD

13,096
12
42
54

So, I tried your solution, and it worked. I have been able to read all 35 million rows (20 GB of dataset) at once (which is not achieved by Pandas previously). However, It took around 20 mins to read, process and then save the dataset using DASK. – Bilal Mirza Dec 04 '20 at 17:14
Moreover, Pandas was using the only one core (to 100%) of my 2x processors. But, DASK was able to utilize both processors (and all cores) equally, so that's the another benefit. – Bilal Mirza Dec 04 '20 at 17:20

How to processes the extremely large dataset into chunks in Python (Pandas), while considering the full dataset for application of function?

2 Answers2