2

I have observed on several occasions huge variations in the speed of pandas functions when performing similar (and seemingly simple) operations. For example, out of the following three, the first two are excruciatingly slow when used on a dataset with a few million lines, while the last performs within seconds (these solutions were taken from String concatenation of two pandas columns:

        df["C"] = df[["A", "B"]].agg("/".join, axis = 1)
        df = df.assign(C = df.apply(lambda row: row.A + "/" + row.B, axis = 1))
        df["C"] = df.A + "/" + df.B

This poses a practical problem that the code tested on a small data sample may turn out extremely inefficient when tried on a bigger sample (potentially by someone else).

Is there a list of slow and fast pandas functions? Or perhaps I do not understand some basic facts about how pandas processes the data?

Roger Vadim
  • 373
  • 2
  • 12
  • 3
    Have you seen this answer: https://stackoverflow.com/a/54298586/9209546 ? – jpp Apr 06 '20 at 09:35
  • 1
    Yes, `.apply` and using aggregate functions that are plain python functions will always be slow. Pandas has provided fast implementations of Python `str` methods available with `df.str`, that will work faster than naively passing python string methods, although never as fast as pure numeric operations, because pandas still uses `object` dtype for strings. – juanpa.arrivillaga Apr 06 '20 at 09:40

0 Answers0