1

I'm working with disk frame and it's great so far.

One piece that confuses me is the chunk size. I sense that a small chunk might create too many tasks and disk frame might eat up time managing those tasks. On the other hand, a big chunk might be too expensive for the workers, decreasing the performance benefits from parallelism.

What pieces of information can we use to make a better guess for chunk size?

Cauder
  • 2,157
  • 4
  • 30
  • 69

1 Answers1

1

This is a tough problem and I probably need better tools.

Currently, everything is on guess basis. But I have made a presentation on this and I will try to bring it into the docs soon.

Ideally, you want

RAM Used = number of workers * RAM usage per chunk

So, if you have 6 workers (ideal for 6 CPU cores), then you would want smaller chunk vs someone with 4 (workers) but same amount of total RAM.

The difficult is in estimating "RAM usage per chunk" which is different for different operations like merge, sort, and just vaniall filtering!

This is a hard problem to solve in general! So no good solution for now.

xiaodai
  • 14,889
  • 18
  • 76
  • 140