I have two datasets, one is around 45GB and it contains daily transactions for 1 year and the second one is 3.6GB and contains customer IDs and details. I want to merge the two together on a common column to create a a single dataset, which exceeds the memory of the server since there can be multiple transactions per customer. I am working on a windows server with 16 cores and 64GB RAM which I understand is very limited specs for this type of work.
Methodology
Read the big dataset into a dask dataframe and set the index to be the customer ID. Read the 3.6GB dataset in pandas and set the index to be the customer ID. Launch a local cluster, with the parameter memory_limit='50GB' and processes=False. Merge the dask dataframe with the pandas dataframe on index (left_index=True, right_index=True).
This method creates 75000 tasks which eventually blow up the memory.
Is what I am trying to accomplish possible? Have I chosen the wrong tools for that? I am running out of ideas and I would desperately need some help.