I'm processing a bunch of text-based records in csv format using Dask, which I am learning to use to work around too large to fit in memory problems, and I'm trying to filter records within groups that best match a complicated criteria.
The best approach I've identified to approach this so far is to basically use Dash to group records in bite sized chunks and then write the applicable logic in Python:
def reduce_frame(partition):
records = partition.to_dict('record')
shortlisted_records = []
# Use Python to locate promising looking records.
# Some of the criteria can be cythonized; one criteria
# revolves around whether record is a parent or child
# of records in shortlisted_records.
for other in shortlisted_records:
if other.path.startswith(record.path) \
or record.path.startswith(other.path):
... # keep one, possibly both
...
return pd.DataFrame.from_dict(shortlisted_records)
df = df.groupby('key').apply(reduce_frame, meta={...})
In case it matters, the complicated criteria revolves around weeding out promising looking links on a web page based on link url, link text, and css selectors across the entire group. Think with given A, and B in shortlist, and C a new record, keep all if each are very very promising, else prefer C over A and/or B if more promising than either or both, else drop C. The resulting Pandas partition objects above are tiny. (The dataset in its entirety is not, hence my using Dask.)
Seeing how Pandas functionality exposes inherently row- and column-based functionality, I'm struggling to imagine any vectorized approach to solving this, so I'm exploring writing the logic in Python.
Is the above the correct way to proceed, or are there more Dask/Pandas idiomatic ways - or simply better ways - to approach this type of problem? Ideally one that allows to parallelize the computations across a cluster? For instance by using Dask.bag
or Dask.delayed
and/or cytoolz
or something else I might have missed while learning Python?