3

I have two datasets, one is around 45GB and it contains daily transactions for 1 year and the second one is 3.6GB and contains customer IDs and details. I want to merge the two together on a common column to create a a single dataset, which exceeds the memory of the server since there can be multiple transactions per customer. I am working on a windows server with 16 cores and 64GB RAM which I understand is very limited specs for this type of work.

Methodology

Read the big dataset into a dask dataframe and set the index to be the customer ID. Read the 3.6GB dataset in pandas and set the index to be the customer ID. Launch a local cluster, with the parameter memory_limit='50GB' and processes=False. Merge the dask dataframe with the pandas dataframe on index (left_index=True, right_index=True).

This method creates 75000 tasks which eventually blow up the memory.

Is what I am trying to accomplish possible? Have I chosen the wrong tools for that? I am running out of ideas and I would desperately need some help.

Dr.Fykos
  • 90
  • 1
  • 8

1 Answers1

0

Yes, what you want to do is possible, but you may need to play with partition sizes a bit. If there is a lot of repetition in your data then Pandas may suddenly produce very large values. You can address this by ...

  1. Using smaller partitions (maybe)
  2. Reducing the amount of parallelism (perhaps try dask.config.set(scheduler="single-threaded") so see if that helps
MRocklin
  • 55,641
  • 23
  • 163
  • 235