2

Lets say I have a pandas dataframe df with columns A, B, C, D, E, F, G, H, and I want to filter the dataframe using a function functn that takes in a "row" and returns true or false based on if the row fulfills certain conditions (lets say the function uses every column except for H). Is there a way to efficiently filter this dataframe without a long and ugly lambda? The solution I have so far looks like this:

df = df[df.apply(functn, axis=1)]

but this method seems to be VERY slow, even for a frame with 15k lines. Is there a clean and efficient way to filter a pandas dataframe using a user defined python function instead of a lambda or query?

note: I previously implemented this using plain python 2d arrays and it was MUCH faster than using pandas. Am I misusing a certain feature or not aware of a way to make this filtering process faster?

edit:

The data is structured roughly like this:

#       A       B       C     D     E     F      G        H      
[
    [string1, string2, int1, int2, int3, int4, float1, float2], 
    ...
]

The function does something like this:

def filter(row):
    var1 = row.G <= 0.01
    partial_a = (((row.D - row.C + 1)*1.0)/global_map[row.A])
    partial_b = (((row.F - row.E + 1)*1.0)/global_map[row.B])
    partial = partial_a >= 0.66 or partial_b >= 0.66
    return var1 and partial

The non-pandas implementation basically took the dataframe, which if not in pandas form was basically a 2d array, and looped through each element, applied the function to it (except the argument was a list instead of a "row"), and if it returned true, added that new element to another list.

swarajd
  • 977
  • 1
  • 10
  • 18
  • 1
    It'll be easier to help you if you can provide (a) example data, (b) the details of `fucntn`, and (c) the non-Pandas implementation you used. As is, it's hard to know where your bottleneck is. (Benchmarking data would be nice too.) – andrew_reece Dec 14 '17 at 04:20
  • I'll edit the original post to reflect this. Thanks! – swarajd Dec 14 '17 at 04:20
  • @andrew_reece it should be updated! – swarajd Dec 14 '17 at 04:26
  • Looking on your filter() apply function, I have noticed that you compute the global_map for each row. Is this computation time-efficient? Otherwise the apply function will calculate the global_map[row.A] and global_map[row.B] for each row..Therefore to speed up processing time you should had pre-computed these values (global_map[row.A] and global_map[row.B]) before the filter() and then pass them to filter() function as arguments. Hope it helps! – Damianos P. Melidis Oct 29 '21 at 10:00

1 Answers1

4

IIUC, you don't need a function. Let's use boolean indexing as follows:

cond1 = df['G'] <= 0.01
cond2 = (((df.D - df.C + 1)*1.0)/global_map[df.A]) >= 0.66
cond3 = (((df.F - df.E + 1)*1.0)/global_map[df.B]) >= 0.66

mask = cond1 & (cond2 | cond3)

df[mask]
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • Thanks! As a followup, is there a way to generalize this mask such that I can apply it to other dataframes? I think this is specific since it references the one `df` I mentioned. – swarajd Dec 14 '17 at 04:53
  • 2
    If `global_map` is a dict, you might need to vectorize the lookup, something like `np.vectorize(global_map.get)(df.A)` – andrew_reece Dec 14 '17 at 04:56
  • You should be able to write that as a function to generate mask then apply the mask to the dataframe without looping. – Scott Boston Dec 14 '17 at 04:59
  • @andrew_reece true that's a good point. scott, yeah I'll try that – swarajd Dec 14 '17 at 05:00
  • Yeah, I just tried it. I created a function genMask(df) return mask, then I did `df1[genMask(df1)]`. It should work. – Scott Boston Dec 14 '17 at 05:04
  • @andrew_reece is there a reason the dict has to be vectorized? I noticed I got an error if it wasn't. Is that because the mask is applied on all elements or something like that? – swarajd Dec 14 '17 at 05:13
  • 1
    dicts are basically just lookup tables, they're not used to having a whole bunch of keys crammed into a single lookup. have a look at this post for more options: https://stackoverflow.com/q/16992713/2799941 – andrew_reece Dec 14 '17 at 05:16
  • @Swarage What sort of increase in speed did you achieve, I am curious? – Scott Boston Dec 14 '17 at 05:16
  • 1
    @ScottBoston it went from taking 3.894 seconds to 0.376 seconds for the entire python program, so roughly a 10x speedup. – swarajd Dec 14 '17 at 05:22