I am trying to do some calculations. This works fine:
import pandas
import numpy
print(pandas.__version__) #=> 0.25.2
columns = ['CUR', 'ISSUER_INDUSTRY', 'PROFIT']
data = [
{'CUR': 'NOK', 'ISSUER_INDUSTRY': 'BANK', 'PROFIT': 10},
{'CUR': 'NOK', 'ISSUER_INDUSTRY': 'BANK', 'PROFIT': 60},
{'CUR': 'NOK', 'ISSUER_INDUSTRY': 'BANK', 'PROFIT': 80},
{'CUR': 'SEK', 'ISSUER_INDUSTRY': 'BANK', 'PROFIT': 30},
{'CUR': 'SEK', 'ISSUER_INDUSTRY': 'TECH', 'PROFIT': 30},
]
df = pandas.DataFrame(data=data, columns=columns)
df['above_50_quantile_profit'] = df.PROFIT >= df.groupby(['CUR','ISSUER_INDUSTRY']).PROFIT.transform('quantile')
df['above_90_quantile_profit'] = df.PROFIT >= df.groupby(['CUR','ISSUER_INDUSTRY']).PROFIT.transform(lambda x: x.quantile(0.9))
The DF now looks like
NOK BANK 10 False False
NOK BANK 60 True False
NOK BANK 80 True True
SEK BANK 30 True True
SEK TECH 30 True True
But in reality data is never this clean. So I have some rows that are missing industry data:
df.loc[df.CUR=='SEK','ISSUER_INDUSTRY'] = numpy.nan
Now my first calculation works fine:
df['above_50_quantile_profit'] = df.PROFIT >= df.groupby(['CUR','ISSUER_INDUSTRY']).PROFIT.transform('quantile')
but my second:
df['above_90_quantile_profit'] = df.PROFIT >= df.groupby(['CUR','ISSUER_INDUSTRY']).PROFIT.transform(lambda x: x.quantile(0.9))
now raised:
ValueError: Length mismatch: Expected axis has 3 elements, new values have 5 elements
I understand that grouping on NaN is a well know problem (see here), what I do not understand is why the first version .transform('quantile')
have the same problem, as it is also a groupby call.
Could anyone help me understand the difference?