5

I am using dask as in how to parallelize many (fuzzy) string comparisons using apply in Pandas?

Basically I do some computations (without writing anything to disk) that invoke Pandas and Fuzzywuzzy (that may not be releasing the GIL apparently, if that helps) and I run something like:

dmaster = dd.from_pandas(master, npartitions=4)
dmaster = dmaster.assign(my_value=dmaster.original.apply(lambda x: helper(x, slave), name='my_value'))
dmaster.compute(get=dask.multiprocessing.get)

However, a variant of the code has been running for 10 hours now, and is not over yet. I notice in windows task manager that

  • RAM utilization is pretty low, corresponding to the size of my data
  • CPU usage bounces from 0% to up to 5% every 2/3 seconds or so
  • I have about 20 Python processes whose size is 100MB, and one Python process that likely contains the data that is 30GB in size (I have a 128 GB machine with a 8 core CPU)

Question is: is that behavior expected? Am I obviously terribly wrong in setting some dask options here?

Of course, I understand the specifics depends on what exactly I am doing, but maybe the patterns above can already tell that something is horribly wrong?

Many thanks!!

Community
  • 1
  • 1
ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
  • 2
    How long do you *expect* this process to take? That machine may be totally idle, deadlocked, or just waiting for something. The only thing we can say for sure is that it sure doesn't look like it's actually *doing* anything. – Ian McLaird Jul 01 '16 at 15:51
  • Thanks @IanMcLaird for your input. I really think the computation should be over by now, especially if there is multiprocessing. Essentially, what are the degrees of freedom here? Setting a different value for `npartitions`? – ℕʘʘḆḽḘ Jul 01 '16 at 15:54

1 Answers1

3

Of course, I understand the specifics depends on what exactly I am doing, but maybe the patterns above can already tell that something is horribly wrong?

This is pretty spot on. Identifying performance issues is tricky, especially when parallel computing comes into play. Here are some things that come to mind.

  1. The multiprocessing scheduler has to move data between different processes between every time. The serialization/deserialization cycle could be quite expensive. Using the distributed scheduler would handle this better.
  2. Your function helper could be doing something oddly.
  3. Generally using apply, even in Pandas, is best to be avoided.

Generally a good way to pin down these problems is to create a minimal, complete, verifiable example to share that others can reproduce and play with easily. Often in when creating such an example you find the solution to your problem anyway. But if this doesn't happen at least you can then pass the buck on to the library maintainer. Until such an example is created most library maintainers don't bother to spend their time, there is almost always too many details specific to the problem at hand to warrant free service.

Community
  • 1
  • 1
MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • hi MRocklin, yes you are right, although I think sometimes just evoking the symptoms (low CPU, etc) can tell something about the deep cause. I ll try to come up with a cme, if that's possible. Thanks again for your time – ℕʘʘḆḽḘ Jul 02 '16 at 01:12
  • @MRocklin what would be a better design pattern alternative to the usage of `apply`? I'm using a scipy regression on Xarray, and it's hard to find an equivalent approach. – Ricardo Barros Lourenço Mar 15 '23 at 01:22