0

I’m trying to figure out if it’s possible to achieve the following without using apply and without using a for loop.

df= [df[x].map(lambda x: len(x) > 5) for x in df.columns]

I’m specifically trying to avoid apply and applymap, and look for a vectorised solution. All values in DF are strings. I’m using the above as a mask later on.

The fastest I've found is:

df1 = [df[x].map(lambda x: len(x) > 5) for x in df.columns]
df2 = df[pd.concat(df1, axis=1, keys=[s.name for s in df1]).any(1)]

It's faster than:

df[(df.applymap(len) > 5).any(axis=1)]
zerohedge
  • 3,185
  • 4
  • 28
  • 63
  • What are you exactly trying to achieve? And can you add some example data for us so we can reproduce an answer? – Erfan Jun 01 '19 at 16:09
  • `df(df.applymap(len) > 5).any(axis=1)]` is actually not a bad solution. Strings are inherently not vectorizable so these solutions are all comparable. Another one is `df.apply(lambda x: x.str.len() > 5)` which applies the comparison column-wise. – cs95 Jun 01 '19 at 16:14
  • @cs95 I’m getting significant speed improvements with without applymap and apply, that’s why I asked. – zerohedge Jun 01 '19 at 16:19
  • @cs95 - I've added some examples that I've tested. – zerohedge Jun 01 '19 at 20:41

1 Answers1

4

How about vectorize, at least it should be slightly faster than apply , about the comparision of for loop , it all depends on your data size and shape . Link, Link

np.vectorize(len)(df.values)>5
BENY
  • 317,841
  • 20
  • 164
  • 234
  • @zerohedge that is as I mentioned , it depends on your real df size :-) – BENY Jun 01 '19 at 20:46
  • `df[np.vectorize(len)(df.values)>5]` is returning the first row for all rows in `df`, how can I use it for my purposes? – zerohedge Jun 02 '19 at 15:55
  • this seems to expand each row to multiple rows (per column), I need an any check that only returns the the row (and only one per row) for every row where the condition is met. – zerohedge Jun 02 '19 at 16:13
  • @zerohedge you can add np.any():-) – BENY Jun 02 '19 at 16:17
  • where do I put that though? The problem is that's expanding each row to multiple row, per column. For now I'm using `df1 = a[np.vectorize(len)(a.values) > 5] df2 = df1.groupby(df1.index).first()` – zerohedge Jun 02 '19 at 16:20
  • 1
    @zerohedge in your case `np.all(np.vectorize(len)(df.values)>5,1)` – BENY Jun 02 '19 at 16:24