how to bin numbers quickly in python with multiple conditions

Question

I want to bin the figures based on different ranges with my own definition.

lambda is easy but what if the condition is more than 2. I used for if but it does not change anything

country = pd.DataFrame({'COUNTRY':['China','JAPAN','KOREA', 'USA', 'UK'],
               'POPULATION':[1200,2345,3400,5600,9600],
               'ECONOMY':[86212,11862,1000, 8555,12000]})

for x in country.POPULATION:
if x < 2000:
    x = 'small'
elif x >2000 and x <=4000:
    x='medium'
elif x > 5000 and x <=6000:
    x='big'
else:
    'huge'

I hope the data can return the 'small', 'medium', etc. according to the range.

score 1 · Accepted Answer · answered Feb 02 '19 at 01:30

I would use np.select with multiple conditions:

conditions = [
    country['POPULATION'] < 2000,
    ((country['POPULATION'] > 2000) & (country['POPULATION'] <= 4000)),
    ((country['POPULATION'] > 5000) & (country['POPULATION'] <=6000))
]

choices = [
    'small',
    'medium',
    'big'
]

# create a new column or assign it to an existing
# the last param in np.select is default
country['new'] = np.select(conditions, choices, 'huge')

  COUNTRY  POPULATION  ECONOMY     new
0   China        1200    86212   small
1   JAPAN        2345    11862  medium
2   KOREA        3400     1000  medium
3     USA        5600     8555     big
4      UK        9600    12000    huge

score 0 · Answer 2 · answered Feb 02 '19 at 01:35

0

np.select from @Chris looks good, but I wrote out an answer for pd.cut (see docs) so I might as well post it:

import pandas as pd
df = pd.DataFrame({'COUNTRY':['China','JAPAN','KOREA', 'USA', 'UK'],
               'POPULATION':[1200,2345,3400,5600,9600],
               'ECONOMY':[86212,11862,1000, 8555,12000]})

df["size"] = pd.cut(df["POPULATION"],
                bins=[0, 2000, 4000, 5000, 6000, df.POPULATION.max()],
                labels=["Small", "Medium", "NaN", "Large", "Huge"])

It's a bit funkier because you handle that gap between 4 and 5 thousand by writing an arbitrary label (in this case I wrote "NaN" but that's wrong)

answered Feb 02 '19 at 01:35

Charles Landau

4,187
1
8
24

Hi Charles, I did this on purpose because of the irregular distance exists. – Feb 02 '19 at 01:39
I wasn't saying it was wrong to drop out the 4000-5000 range, just that it means you need a label for that range in order to make `pd.cut` work – Charles Landau Feb 02 '19 at 01:39
You are right. But it seems to create unnecessary Nans... – Feb 02 '19 at 01:49
1

The docs say: `Any NA values will be NA in the result. Out of bounds values will be NA in the resulting Series or pandas.Categorical object` – Charles Landau Feb 02 '19 at 01:51

how to bin numbers quickly in python with multiple conditions

2 Answers2