0

I have a data frame with nearly 4~6M rows, which takes quite a lot memories to load. It would be fine if I just read and process the data one by one, but the problem is that I need to aggregate the data. (like sum or average)

Maybe I could just assign more memories to my worker, but I don't know how large the data will be in the future.

So my first thought is to take the data partially from the database, and aggregate the data for each chunk, and then combine the results:

# Pseudo code
aggregate_results = []
for i in range(number_of_chunks):
    data = preprocess(get_data(i))
    aggregate_results.append(aggregate(data))
final_result = combine(aggregate_results)

Then I thought there might be a solution for this kind of work already.

Is there any other options that I can take? I'm using AWS Redshift to store the data and Apache Airflow to schedule the tasks, but I don't have any knowledge about big data solutions like Spark or Hadoop.

I used to use SQL to aggregate the data in the past, but now I'm using Pandas because I need to preprocess the data before the aggregation. So just using database directly is not an option.

Any help would be appreciated.

Tura
  • 1,227
  • 11
  • 22
  • 1
    If input was a CSV file you could use Pandas function pd.read_csv chunksize parameter as illustrated by [How do I read a large csv file with pandas](https://stackoverflow.com/questions/25962114/how-do-i-read-a-large-csv-file-with-pandas) – DarrylG Jul 01 '22 at 09:26
  • Why can't you use Redshift to perform the aggregation? It's what it is good at. Then your data frame can just run on the reduced set. – Bill Weiner Jul 01 '22 at 17:39
  • @BillWeiner As I stated in the original post, I need to manipulate the input data before doing aggregation. Most of other tasks run on Redshift directly, but this one cannot. – Tura Jul 03 '22 at 10:57
  • I'm trying to use dask as they say [it's useful when the dataset doesn't fit the memory](https://docs.dask.org/en/stable/dataframe.html#common-uses-and-anti-uses) – Tura Jul 03 '22 at 11:01

0 Answers0