3

I have a pandas dataframe and I want to create categories in a new column based on the values of another column. I can solve my basic problem by doing this:

range = {
    range(0, 5) : 'Below 5',
    range(6,10): 'between',
    range(11, 1000) : 'above'
}

df['range'] = df['value'].map(range)

In the final dictionary key I have chosen a large upper value for range to ensure it captures all the values I am trying to map. However, this seems an ugly hack and am wondering how to generalise this without specifying the upper limit. ie. if > 10 : 'above'.

Thanks

ojp
  • 973
  • 1
  • 11
  • 26
  • Think about using `apply` – Michael Heidelberg Mar 03 '20 at 15:40
  • Aye! I will give that a go! – ojp Mar 03 '20 at 15:43
  • 3
    Does this answer your question? [Mapping ranges of values in pandas dataframe](https://stackoverflow.com/questions/50098025/mapping-ranges-of-values-in-pandas-dataframe) Also, see [How to map numeric data into categories / bins in Pandas dataframe](https://stackoverflow.com/q/49382207/7851470) – Georgy Mar 04 '20 at 13:08

6 Answers6

5

You could set all values first to 'above', and then use map() for the remaining options (thus with your range dict having only two items):

range = {
    range(0, 5) : 'Below 5',
    range(6,10): 'between',

}
df['range'] = 'above'
df['range'] = df['value'].map(range)
9769953
  • 10,344
  • 3
  • 26
  • 37
5

Assume you have a dataframe like this:

  range value
0   0     0
1   1     1
2   2     2
3   3     3
4   4     4
5   5     5
6   6     6
7   7     7
8   8     8
9   9     9

Then you can apply the following function to the column 'value':

def get_value(range):
    if range < 5:
        return 'Below 5'
    elif range < 10:
        return 'Between 5 and 10'
    else:
        return 'Above 10'

df['value'] = df.apply(lambda col: get_value(col['range']), axis=1)

To get the result you want.

Michael Heidelberg
  • 993
  • 11
  • 24
4

Thanks for the hints. I see I can achieve the same with:

df['range'] = df['value'].map(range).fillna('above')
ojp
  • 973
  • 1
  • 11
  • 26
3

Here's another approach using numpy.select, where you specify a list of boolean conditions, and a list of choices:

import numpy as np 

# Setup
df = pd.DataFrame({'value': [1, 3, 6, 8, 20, 10000000]})

condlist = [
    df['value'].lt(5),
    df['value'].between(5, 10),
    df['value'].gt(10)]

choicelist = ['Below 5', 'between', 'above']

df['out'] = np.select(condlist, choicelist)
print(df)

[out]

      value      out
0         1  Below 5
1         3  Below 5
2         6  between
3         8  between
4        20    above
5  10000000    above

Another idea would be to use pandas.cut with bins and labels parameters specified:

df['out'] = pd.cut(df['value'], bins=[-np.inf, 5, 10, np.inf],
                   labels=['below', 'between', 'above'])

      value      out
0         1    below
1         3    below
2         6  between
3         8  between
4        20    above
5  10000000    above
Chris Adams
  • 18,389
  • 4
  • 22
  • 39
3

pandas.Series.map accepts also function as first argument, so you could do:

def fun(x):
    if x in range(0, 5):
        return 'Below 5'
    elif x in range(6, 10):
        return 'between'
    elif x >= 11:
        return 'above'

then:

df['range'] = df['value'].map(fun)
Daweo
  • 31,313
  • 3
  • 12
  • 25
  • Oh, nice. Now for replacing `run` with a lambda function with a nested ternary comparison, to one-line everything. I guess `df['range'] = df['value'].map(lambda x: 'Below 5' if in x range(0, 5) else ('between' if x in (range(6, 10) else 'above'))` could do that. – 9769953 Mar 03 '20 at 16:01
  • @0 0: for me one-line nested `if`s are not very readable. If I would have fit into 1 line in this case AT ANY PRICE then I would rather do: `f = lambda x:('Below 5','between','above')[(x>=5)+(x>=11)]` – Daweo Mar 04 '20 at 08:23
  • Also nice.I was definitely going for obfuscated Python one-liners. Luckily, in Python,`True + True == 2`. – 9769953 Mar 04 '20 at 09:45
0
 df['range'] = pd.cut(df['value'], bins = [0, 5, 10, 1000], labels = ["below 5", "between", "above"])
r.user.05apr
  • 5,356
  • 3
  • 22
  • 39
  • 1
    That still suffers from setting a rather arbritrary upper limit (one could use `max(df['value'])` of course. – 9769953 Mar 03 '20 at 15:48