I have a data frame with nearly 4~6M rows, which takes quite a lot memories to load. It would be fine if I just read and process the data one by one, but the problem is that I need to aggregate the data. (like sum or average)
Maybe I could just assign more memories to my worker, but I don't know how large the data will be in the future.
So my first thought is to take the data partially from the database, and aggregate the data for each chunk, and then combine the results:
# Pseudo code
aggregate_results = []
for i in range(number_of_chunks):
data = preprocess(get_data(i))
aggregate_results.append(aggregate(data))
final_result = combine(aggregate_results)
Then I thought there might be a solution for this kind of work already.
Is there any other options that I can take? I'm using AWS Redshift to store the data and Apache Airflow to schedule the tasks, but I don't have any knowledge about big data solutions like Spark or Hadoop.
I used to use SQL to aggregate the data in the past, but now I'm using Pandas because I need to preprocess the data before the aggregation. So just using database directly is not an option.
Any help would be appreciated.