Custom boolean filtering in Pandas?

Question

I have a dataframe

                  0         1         2         3       Marketcap
0  1.707280  0.666952  0.638515 -0.061126  2.291747     1.71B
1 -1.017134  1.353627  0.618433  0.008279  0.148128     1.82B
2 -0.774057 -0.165566 -0.083345  0.741598 -0.139851      1.1M
3 -0.630724  0.250737  1.308556 -1.040799  1.064456    30.92M
4  2.029370  0.899612  0.261146  1.474148 -1.663970   476.74k
5  2.029370  0.899612  0.261146  1.474148 -1.663970        -1

Is there some sort of custom filter method, that would let Python know B > M > K?

Say I want to filter, df[df.Marketcap > 35.00M], is there a clever or clean way to do this? Having the M or B makes the value very readable and easy to differentiate.

Thank you.

EDIT: Reopened the thread as Max U's answer while excellent seems to produce a pandas bug, which we opened an issue on Github.

elPastor · Answer 1 · 2017-05-08T03:10:58.450

This isn't super clean, but it does the trick and doesn't use any python iteration:

Code:

# Create a separate column (which you can omit later) that converts 'Marketcap' strings to numbers
df['cap'] = df.loc[df['Marketcap'].str.contains('B'), 'Marketcap'].str.replace('B','').astype(float) * 1000
df['cap'].fillna(df.loc[df['Marketcap'].str.contains('M'), 'Marketcap'].str.replace('M',''), inplace = True)

# For pandas pre-0.20.0 (<May 2017)
print df.ix[df['cap'].astype(float) > 35, :-1]

# For pandas 0.20.0+ (.ix[] deprecated)
print df.iloc[df[df['cap'].astype(float) > 35].index, :-1]

# Or, alternate pandas 0.20.0+ option (thanks @Psidom)
print df[df['cap'].astype(float) > 35].iloc[:,:-1]

Output:

          0         1         2         3         4 Marketcap
0  1.707280  0.666952  0.638515 -0.061126  2.291747     1.71B
1 -1.017134  1.353627  0.618433  0.008279  0.148128     1.82B
4  2.029370  0.899612  0.261146  1.474148 -1.663970    100.9M

Thank you. I just realized that I also have a `k` (1000's) in my dataframe, so I updated the dataframe in OP, to reflect this. Would you be able to update your code to reflect this? Thank you so much. — Moondra, May 09 '17 at 15:26
moondra - @MaxU's solution is much cleaner than mine, I don't see any need to reinvent his wheel. — elPastor, May 10 '17 at 11:17

MaxU - stand with Ukraine · Answer 2 · 2017-05-08T18:36:55.170

2

UPDATE:

In [44]: df
Out[44]:
          0         1         2         3         4 Marketcap
0  1.707280  0.666952  0.638515 -0.061126  2.291747     1.71B
1 -1.017134  1.353627  0.618433  0.008279  0.148128     1.82B
2 -0.774057 -0.165566 -0.083345  0.741598 -0.139851      1.1M
3 -0.630724  0.250737  1.308556 -1.040799  1.064456    30.92M
4  2.029370  0.899612  0.261146  1.474148 -1.663970   476.74k
5  2.029370  0.899612  0.261146  1.474148 -1.663970        -1

In [45]: df[pd.eval(df.Marketcap.replace(['[Kk]','[Mm]','[Bb]'],
                                         ['*10**3','*10**6','*10**9'], regex=True) \
                      .add(' < 35*10**6'))]
Out[45]:
          0         1         2         3         4 Marketcap
2 -0.774057 -0.165566 -0.083345  0.741598 -0.139851      1.1M
3 -0.630724  0.250737  1.308556 -1.040799  1.064456    30.92M
4  2.029370  0.899612  0.261146  1.474148 -1.663970   476.74k
5  2.029370  0.899612  0.261146  1.474148 -1.663970        -1

I'd do it this way:

In [13]: df[pd.eval(df.Marketcap.replace(['M','B'],['','*1000'], regex=True).add(' > 35'))]
Out[13]:
          0         1         2         3         4 Marketcap
0  1.707280  0.666952  0.638515 -0.061126  2.291747     1.71B
1 -1.017134  1.353627  0.618433  0.008279  0.148128     1.82B
4  2.029370  0.899612  0.261146  1.474148 -1.663970    100.9M

Explanation:

In [14]: df.Marketcap.replace(['M','B'],['','*1000'], regex=True)
Out[14]:
0    1.71*1000
1    1.82*1000
2          1.1
3        30.92
4        100.9
Name: Marketcap, dtype: object

In [15]: df.Marketcap.replace(['M','B'],['','*1000'], regex=True).add(' > 35')
Out[15]:
0    1.71*1000 > 35
1    1.82*1000 > 35
2          1.1 > 35
3        30.92 > 35
4        100.9 > 35
Name: Marketcap, dtype: object

In [16]: pd.eval(df.Marketcap.replace(['M','B'],['','*1000'], regex=True).add(' > 35'))
Out[16]: array([True, True, False, False, True], dtype=object)

edited May 08 '17 at 18:36

answered May 08 '17 at 10:16

MaxU - stand with Ukraine

205,989
36
386
419

Why do you have `regex = True` enabled? – Moondra May 08 '17 at 16:34
If I have `regex = True`, I'm running into this error; `'PandasExprVisitor' object has no attribute 'visit_Ellipsis'` . If I turn it off I'm running into an different error, which I posted in the OP as an image. Any ideas about the errors? – Moondra May 08 '17 at 17:14
@moondra, what is your Pandas version? – MaxU - stand with Ukraine May 08 '17 at 18:03
@moondra, we need `regex=True`, because otherwise the "multi-replace" (when we specify a list of replacements) will not work - it will not replace anything – MaxU - stand with Ukraine May 08 '17 at 18:06
my Pandas version is '0.19.2' – Moondra May 08 '17 at 18:08
@moondra, that's weird. Do you have any NaN's/None's or empty strings in the `Marketcap` column? – MaxU - stand with Ukraine May 08 '17 at 18:11
1

I ran this line and only found these two; the `k` I took care of incorporating 'k' in the code. I will update the OP so you can see everything clearly. – Moondra May 08 '17 at 18:18
@moondra, aha, got it. What do we do with `k`? Do you want to remove it or to ignore those rows or something else? – MaxU - stand with Ukraine May 08 '17 at 18:25
I'm not sure if you saw my next line,(under the image) but I tried dividing 'k' by a 1000. – Moondra May 08 '17 at 18:26
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/143714/discussion-between-maxu-and-moondra). – MaxU - stand with Ukraine May 08 '17 at 18:29
1

Okay, I submitted it as a bug. Hopefully they will take a look at it. Thank you for your help =) – Moondra May 08 '17 at 19:26
@moondra, good job! I've added a [comment there](https://github.com/pandas-dev/pandas/issues/16289) for reproducing this issue... – MaxU - stand with Ukraine May 08 '17 at 19:45
Thanks for that comment, that should help. – Moondra May 08 '17 at 19:49
( I don't want to take away, your points, but going to reopen the thread to look for other viable answers) – Moondra May 09 '17 at 15:00

MaxU - stand with Ukraine · Accepted Answer · 2017-05-09T19:30:38.720

Source DF:

In [176]: df
Out[176]:
                    0         1         2         3 Market Cap
0  1.707280  0.666952  0.638515 -0.061126  2.291747      1.71B
1 -1.017134  1.353627  0.618433  0.008279  0.148128      1.82B
2 -0.774057 -0.165566 -0.083345  0.741598 -0.139851       1.1M
3 -0.630724  0.250737  1.308556 -1.040799  1.064456     30.92M
4  2.029370  0.899612  0.261146  1.474148 -1.663970    476.74k
5  2.029370  0.899612  0.261146  1.474148 -1.663970         -1

Solution:

to_replace = ['\d+\s*[Kk]','\d+\s*[Mm]','\d+\s*[Bb]', '-1', 'N/A']
value = [1000,1000000,1000000000, 1, 1]

mask = df.assign(
    f=df['Market Cap'].replace(to_replace, value, regex=True),
    Marketcap=pd.to_numeric(df['Market Cap'].str.replace(r'[^\d\.]', ''), errors='coerce')
).eval("Marketcap * f < 35000000")

df[mask]

Result:

In [178]: df[mask]
Out[178]:
                    0         1         2         3 Market Cap
2 -0.774057 -0.165566 -0.083345  0.741598 -0.139851       1.1M
3 -0.630724  0.250737  1.308556 -1.040799  1.064456     30.92M
4  2.029370  0.899612  0.261146  1.474148 -1.663970    476.74k
5  2.029370  0.899612  0.261146  1.474148 -1.663970         -1

PS if you want to leave non-numeric values (like N/A) in the resulting data set change:

pd.to_numeric(df['Market Cap'].str.replace(r'[^\d\.]', ''), errors='coerce')

to

pd.to_numeric(df['Market Cap'].str.replace(r'[^\d\.]', ''), errors='coerce').fillna('0')

Thanks! I will take a look at it later today as it seems a little complicated and will need some time. BTW, to get these clean looking output cells (out[178] etc), do you do everything in Ipython via command line and just copy the cells? I tried copying Jupyter notebook output cells but when I paste it here, it's very untidy. — Moondra, May 09 '17 at 19:32
@moondra, yes, sorry, i do prefer iPython as I'm kind of console man ;-) — MaxU - stand with Ukraine, May 09 '17 at 19:33
Hi Max, I have a question about the `mask` portion of the code; The first `f` within `df.assign` creates a new column right? And the second part, `Marketcap=pd.to_numeric` is also creating a new column? I'm having a little trouble understanding that portion. Thanks! — Moondra, May 10 '17 at 17:20
@moondra, yes, `f` - is a new column (factor: 1, 1000, 1000000, etc.) and `Marketcap` is a cleaned numeric representation of `Market Cap` — MaxU - stand with Ukraine, May 10 '17 at 17:22

Custom boolean filtering in Pandas?

3 Answers3

Linked