I am using dask
as in how to parallelize many (fuzzy) string comparisons using apply in Pandas?
Basically I do some computations (without writing anything to disk) that invoke Pandas
and Fuzzywuzzy
(that may not be releasing the GIL apparently, if that helps) and I run something like:
dmaster = dd.from_pandas(master, npartitions=4)
dmaster = dmaster.assign(my_value=dmaster.original.apply(lambda x: helper(x, slave), name='my_value'))
dmaster.compute(get=dask.multiprocessing.get)
However, a variant of the code has been running for 10 hours now, and is not over yet. I notice in windows task manager that
RAM utilization
is pretty low, corresponding to the size of my dataCPU usage
bounces from 0% to up to 5% every 2/3 seconds or so- I have about
20 Python processes
whose size is 100MB, and one Python process that likely contains the data that is 30GB in size (I have a 128 GB machine with a 8 core CPU)
Question is: is that behavior expected? Am I obviously terribly wrong in setting some dask
options here?
Of course, I understand the specifics depends on what exactly I am doing, but maybe the patterns above can already tell that something is horribly wrong?
Many thanks!!