3

Assume I have the following pd.DataFrame:

df = pd.DataFrame({'a': [10, 20, 30],
                   'b': [5, 25, 30]})

and I wish to get

    a   b    label
0   10  5    1
1   20  25   2
2   30  30   3

meaning:

  • if a > b then label=1
  • if a < b then label=2
  • if a = b then label=3

I'm not sure how to do so when I have multiple conditions.

Buzi
  • 248
  • 3
  • 12

3 Answers3

4

Having some fun with np.sign, naturally assigns categories to signs:

df['label'] = np.sign(df['a'] - df['b']).map({1: 1, -1: 2, 0: 3})
df

    a   b  label
0  10   5      1
1  20  25      2
2  30  30      3

The funny thing to note here is that np.sign outputs a Series, so I can call Series.map on it directly to get the labels you want. Neat!

cs95
  • 379,657
  • 97
  • 704
  • 746
  • 1
    @Sushanth Thank you, and you as well. If I'd answered first I'd have in my post what you have since `np.where` is the defacto soln for this (and about 10-20% faster than this by tests). But since I'm late to the party I'd share something you don't usually see :-) – cs95 Jul 04 '20 at 10:26
  • 1
    @cs95 Nice answer, just a small query, the documentation of `np.sign` states that np.sign returns `ndarray` so how is this possible that we can use series methods on this? – Shubham Sharma Jul 04 '20 at 10:29
  • 2
    @ShubhamSharma Great question, in a nutshell, numpy knows that it is the foundation for a lot of libraries and so tries its best to work with them. Think of a pandas Series as a fancy wrapping object over a numpy array (which it is). Numpy will try to preserve the identity of the "wrapping object" so to speak, while transforming the underlying data. See [related q](https://stackoverflow.com/questions/47893677/why-are-numpy-functions-so-slow-on-pandas-series-dataframes). – cs95 Jul 04 '20 at 10:33
  • @cs95 Thanks for the explanation, i got the point. Already bookmarked and upvoted that question you suggested :). – Shubham Sharma Jul 04 '20 at 10:45
  • @cs95 i tried using `np.diff` on the pandas series but in this case it returns a normal `ndarray` as in case `np.sign()` it returns pandas series. so here we could not be able to use the series methods. Can you explain the behaviour here...? – Shubham Sharma Jul 04 '20 at 10:54
  • @ShubhamSharma I think the difference is `np.diff` returns an array that's 1-element smaller than the input, so perhaps it makes no assumptions and just returns an array. – cs95 Jul 04 '20 at 10:56
  • 1
    @cs95 so i guess the result of the np operation should have the same size as original input series in order for the wrapping behaviour to work, right? By the way really thanks for explaining.. – Shubham Sharma Jul 04 '20 at 11:03
2

try this, np.where & .loc

df['label'] = np.where(df['a'] > df['b'], 1, 2)

df.loc[df['a'] == df['b'], 'label'] = 3

Edit

df['label'] = np.where(df['a'] > df['b'], 1, (np.where(df['a'] < df['b'], 2, 3)))
sushanth
  • 8,275
  • 3
  • 17
  • 28
2

Use, np.select to select from choices based on conditions:

df['label'] = np.select([df['a'].gt(df['b']), df['a'].lt(df['b'])], [1, 2], 3)

Result:

# print(df)

    a   b   label
0   10  5       1
1   20  25      2
2   30  30      3
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53