How should we select the chunk size in disk frame?

Question

I'm working with disk frame and it's great so far.

One piece that confuses me is the chunk size. I sense that a small chunk might create too many tasks and disk frame might eat up time managing those tasks. On the other hand, a big chunk might be too expensive for the workers, decreasing the performance benefits from parallelism.

What pieces of information can we use to make a better guess for chunk size?

score 1 · Accepted Answer · answered Sep 17 '20 at 02:21

This is a tough problem and I probably need better tools.

Currently, everything is on guess basis. But I have made a presentation on this and I will try to bring it into the docs soon.

Ideally, you want

RAM Used = number of workers * RAM usage per chunk

So, if you have 6 workers (ideal for 6 CPU cores), then you would want smaller chunk vs someone with 4 (workers) but same amount of total RAM.

The difficult is in estimating "RAM usage per chunk" which is different for different operations like merge, sort, and just vaniall filtering!

This is a hard problem to solve in general! So no good solution for now.

How should we select the chunk size in disk frame?

1 Answers1