0

I am trying to do some calculations. This works fine:

import pandas
import numpy

print(pandas.__version__) #=> 0.25.2

columns = ['CUR', 'ISSUER_INDUSTRY', 'PROFIT']
data = [
    {'CUR': 'NOK', 'ISSUER_INDUSTRY': 'BANK', 'PROFIT': 10},
    {'CUR': 'NOK',  'ISSUER_INDUSTRY': 'BANK', 'PROFIT': 60},
    {'CUR': 'NOK', 'ISSUER_INDUSTRY': 'BANK', 'PROFIT': 80},
    {'CUR': 'SEK', 'ISSUER_INDUSTRY': 'BANK', 'PROFIT': 30},
    {'CUR': 'SEK', 'ISSUER_INDUSTRY': 'TECH', 'PROFIT': 30},
]

df = pandas.DataFrame(data=data, columns=columns)

df['above_50_quantile_profit'] = df.PROFIT >= df.groupby(['CUR','ISSUER_INDUSTRY']).PROFIT.transform('quantile')

df['above_90_quantile_profit'] = df.PROFIT >= df.groupby(['CUR','ISSUER_INDUSTRY']).PROFIT.transform(lambda x: x.quantile(0.9))

The DF now looks like

NOK     BANK    10  False   False
NOK     BANK    60  True    False
NOK     BANK    80  True    True
SEK     BANK    30  True    True
SEK     TECH    30  True    True

But in reality data is never this clean. So I have some rows that are missing industry data:

df.loc[df.CUR=='SEK','ISSUER_INDUSTRY'] = numpy.nan

Now my first calculation works fine:

df['above_50_quantile_profit'] = df.PROFIT >= df.groupby(['CUR','ISSUER_INDUSTRY']).PROFIT.transform('quantile')

but my second:

df['above_90_quantile_profit'] = df.PROFIT >= df.groupby(['CUR','ISSUER_INDUSTRY']).PROFIT.transform(lambda x: x.quantile(0.9))

now raised:

ValueError: Length mismatch: Expected axis has 3 elements, new values have 5 elements

I understand that grouping on NaN is a well know problem (see here), what I do not understand is why the first version .transform('quantile') have the same problem, as it is also a groupby call.

Could anyone help me understand the difference?

hirolau
  • 13,451
  • 8
  • 35
  • 47
  • 1
    There is problem if NaNs then rows are removed by design, so added link for solution. – jezrael Oct 30 '19 at 10:15
  • A simple way to deal with this problem would be a conversion to `Int64`. So you could implement this: `df['above_90_quantile_profit'] = df.PROFIT >= df.groupby(['CUR','ISSUER_INDUSTRY']).PROFIT.transform(lambda x: x.astype("Int64").quantile(0.9))` – Fourier Oct 30 '19 at 10:18

0 Answers0