4

I have a dataframe with 1 million rows. I have a single function (which I can't vectorize) to apply to each row. I looked into swifter which promises to leverage multiple process to speed up computations. On an 8-cores machine it's however not the case.

Any idea why?

def parse_row(n_print=None):
    def f(row):
        if n_print is not None and row.name % n_print == 0:
            print(row.name, end="\r")
        return Feature(
            geometry=Point((float(row["longitude"]), float(row["latitude"]))),
            properties={
                "water_level": float(row["water_level"]),
                "return_period": float(row["return_period"])
            }
        )
    return f

In [12]: df["feature"] = df.swifter.apply(parse_row(), axis=1)
Dask Apply: 100%|████████████████████████████████████████| 48/48 [01:19<00:00,  1.65s/it]

In [13]: t = time(); df["feature"] = df.apply(parse_row(), axis=1); print(int(time() - t))
46
ted
  • 13,596
  • 9
  • 65
  • 107
  • 3
    It looks like the speed is dependent on the size of row. This df.swifter.apply(lambda x: 1 if x>5 else 0) is slower than simple apply when size(df)<10**8. Try `pandarallel`, which seems work nice. https://github.com/nalepae/pandarallel – notilas Aug 05 '20 at 21:25

1 Answers1

2

It mainly depends on the processing power involved and if vectorization / parallel processing / optimization can improve the problem. Sometimes it simply isn't a solution. Also remember that swifter takes time to calculate it's projected work timespan, and sometimes df.apply will be faster since it won't have to calculate that and optimization might of not been helpful as well.

msarafzadeh
  • 395
  • 1
  • 4
  • 15