I'm trying to create income buckets/groups based on already existing income groups. I want to create a new column for my dataframe to do that.
The issue is that the already existing income groups can't be matched because there are different ranges and currencies.
Initially I wanted to use regex to sort it but I gave up (don't know how to do it or even if it is possible)
I resorted to do the following:
def Income_Groups(AnnualIncome):
Income = {
'Under £5,000':'<25k','less than £25,000':'<25k','less than €25,000':'<25k','Between_0_5':'<25k','Between_0_25':'<25k','Between_5_15':'<25k','Between_15_30':'<25k',
'£25,001-£50,000':'25-50k','£30,000-£50,000':'25-50k','€25,001-€50,000':'25-50k','Between_25_50':'25-50k','Between_30_50':'25-50k',
'£50,001-£100,000':'50-100k','€50,001-€100,000':'50-100k','Between_50_75':'50-100k','Between_75_100':'50-100k','Between_50_100':'50-100k',
'£100,000+':'>100k','€100,000+':'>100k','Above_100':'>100k'
}
try:
return Income[AnnualIncome]
except:
return AnnualIncome
data_m['IncomeGroups'] = data_m.AnnualIncome.apply(Income_Groups)
This code worked but it doesn't give me the option to choose what I want to do with missing data, it automatically replaces the missing cells with '0', which is not what I want. I would rather see 'Na' or see the cells left as empty cells.
I then tried another code (easier to read):
def Income_Groups(AnnualIncome):
if AnnualIncome in 'Under £5,000'|'less than £25,000'|'less than €25,000'|'Between_0_5'|'Between_0_25'|'Between_5_15'|'Between_15_30': return '<25k'
elif AnnualIncome in '£25,001-£50,000'|'£30,000-£50,000'|'€25,001-€50,000'|'Between_25_50'|'Between_30_50': return '25-50k'
elif AnnualIncome in '£50,001-£100,000'|'€50,001-€100,000'|'Between_50_75'|'Between_75_100'|'Between_50_100': return '50-100k'
elif AnnualIncome in '£100,000+'|'€100,000+'|'Above_100': return '>100k'
else: return ''
data_m['IncomeGroups'] = data_m.AnnualIncome.apply(Income_Groups)
(I haven't tried do one 'if/elif' and 'return' per condition since there are a bit to many.)
However, for this 2nd code I get the following error:
TypeError Traceback (most recent call last) in 8 else: return '' 9 ---> 10 data_m['IncomeGroups'] = data_m.AnnualIncome.apply(Income_Groups)
~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds) 3846 else: 3847 values = self.astype(object).values -> 3848 mapped = lib.map_infer(values, f, convert=convert_dtype) 3849 3850 if len(mapped) and isinstance(mapped[0], Series):
pandas_libs\lib.pyx in pandas._libs.lib.map_infer()
in Income_Groups(AnnualIncome) 2 3 def Income_Groups(AnnualIncome): ----> 4 if AnnualIncome in 'Under £5,000'|'less than £25,000'|'less than €25,000'|'Between_0_5'|'Between_0_25'|'Between_5_15'|'Between_15_30': return '<25k' 5 elif AnnualIncome in '£25,001-£50,000'|'£30,000-£50,000'|'€25,001-€50,000'|'Between_25_50'|'Between_30_50': return '25-50k' 6 elif AnnualIncome in '£50,001-£100,000'|'€50,001-€100,000'|'Between_50_75'|'Between_75_100'|'Between_50_100': return '50-100k'
TypeError: unsupported operand type(s) for |: 'str' and 'str'
Would greatly appreciate your help!!