Segmenting income into income groups by creating a new column -- Python

Question

I'm trying to create income buckets/groups based on already existing income groups. I want to create a new column for my dataframe to do that.

The issue is that the already existing income groups can't be matched because there are different ranges and currencies.

Initially I wanted to use regex to sort it but I gave up (don't know how to do it or even if it is possible)

I resorted to do the following:

def Income_Groups(AnnualIncome): 

  Income = {
      'Under £5,000':'<25k','less than £25,000':'<25k','less than €25,000':'<25k','Between_0_5':'<25k','Between_0_25':'<25k','Between_5_15':'<25k','Between_15_30':'<25k', 
      '£25,001-£50,000':'25-50k','£30,000-£50,000':'25-50k','€25,001-€50,000':'25-50k','Between_25_50':'25-50k','Between_30_50':'25-50k', 
      '£50,001-£100,000':'50-100k','€50,001-€100,000':'50-100k','Between_50_75':'50-100k','Between_75_100':'50-100k','Between_50_100':'50-100k', 
      '£100,000+':'>100k','€100,000+':'>100k','Above_100':'>100k' 
  }
  
  try:
      return Income[AnnualIncome]
  except:
      return AnnualIncome

data_m['IncomeGroups'] = data_m.AnnualIncome.apply(Income_Groups)

This code worked but it doesn't give me the option to choose what I want to do with missing data, it automatically replaces the missing cells with '0', which is not what I want. I would rather see 'Na' or see the cells left as empty cells.

I then tried another code (easier to read):

def Income_Groups(AnnualIncome): 
    if AnnualIncome in 'Under £5,000'|'less than £25,000'|'less than €25,000'|'Between_0_5'|'Between_0_25'|'Between_5_15'|'Between_15_30': return '<25k' 
    elif AnnualIncome in '£25,001-£50,000'|'£30,000-£50,000'|'€25,001-€50,000'|'Between_25_50'|'Between_30_50': return '25-50k'
    elif AnnualIncome in '£50,001-£100,000'|'€50,001-€100,000'|'Between_50_75'|'Between_75_100'|'Between_50_100': return '50-100k' 
    elif AnnualIncome in '£100,000+'|'€100,000+'|'Above_100': return '>100k' 
    else: return ''

data_m['IncomeGroups'] = data_m.AnnualIncome.apply(Income_Groups)

(I haven't tried do one 'if/elif' and 'return' per condition since there are a bit to many.)

However, for this 2nd code I get the following error:

TypeError Traceback (most recent call last) in 8 else: return '' 9 ---> 10 data_m['IncomeGroups'] = data_m.AnnualIncome.apply(Income_Groups)

~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds) 3846 else: 3847 values = self.astype(object).values -> 3848 mapped = lib.map_infer(values, f, convert=convert_dtype) 3849 3850 if len(mapped) and isinstance(mapped[0], Series):

pandas_libs\lib.pyx in pandas._libs.lib.map_infer()

in Income_Groups(AnnualIncome) 2 3 def Income_Groups(AnnualIncome): ----> 4 if AnnualIncome in 'Under £5,000'|'less than £25,000'|'less than €25,000'|'Between_0_5'|'Between_0_25'|'Between_5_15'|'Between_15_30': return '<25k' 5 elif AnnualIncome in '£25,001-£50,000'|'£30,000-£50,000'|'€25,001-€50,000'|'Between_25_50'|'Between_30_50': return '25-50k' 6 elif AnnualIncome in '£50,001-£100,000'|'€50,001-€100,000'|'Between_50_75'|'Between_75_100'|'Between_50_100': return '50-100k'

TypeError: unsupported operand type(s) for |: 'str' and 'str'

Would greatly appreciate your help!!

`|` is not the correct operator. Try `AnnualIncome in ('Under £5,000','less than £25,000','less than €25,000','Between_0_5','Between_0_25','Between_5_15','Between_15_30'): return '<25k' ` — wwii, Oct 12 '20 at 14:42
[https://docs.python.org/3/reference/expressions.html#membership-test-operations](https://docs.python.org/3/reference/expressions.html#membership-test-operations), `|` is the [binary bitwise or operator](https://docs.python.org/3/reference/expressions.html#binary-bitwise-operations) — wwii, Oct 12 '20 at 14:52

score 0 · Answer 1 · answered Oct 12 '20 at 14:43

You're trying to do a bitwise operation on a string.

replace all your | with , and add brackets to do a list of string like in that example :

AnnualIncome = "Under £5,000"

if AnnualIncome in ['Under £5,000','less than £25,000','less than €25,000','Between_0_5','Between_0_25','Between_5_15','Between_15_30']:
    print("ok")

Output :

ok

Segmenting income into income groups by creating a new column -- Python

1 Answers1