The pandas equivalent of 'if' 'else' conditionals to add calculated column to df

Question

the below table counts unique words in a text (German text of Hamlet in this case).

Using Pandas I would like to add a column['frequency'] that prints one of three answers.

If the value in the 'count' column is <=10 the frequency is 'infrequent'
If the value in the 'count' column is >10 the frequency is 'frequent'
If the value in the 'count' column is 1 the frequency is 'unique'

I am new to pandas so I initially thought I would have to use a 'for' loop and 'if' 'else'. Of course, that didn't work for me and after reading around this I see you can just use .loc[] instead. It's much cleaner.

I'll put the answer below in case anyone else needs this setting out really clearly. Here's the table I'm working with before-

      count                 word  length
0     67223                            0
1         7               deinen       6
2         1          überwachsen      11
3         3                 them       4
4         2            fortunens       9
5         1              flammen       7
6         1    ersäuentsezlichen      17
7         2              alleino       7
8         1             empfehle       8
9         1  beschulöffentlicher      19
10        1         unterthänige      12
11        1                   pr       2
12        1       zurükzutreiben      14
13       38                   wo       2
14        1          schadhaften      11
15        1               ddiese       6
16        1         zurükhaltend      12
17        1                 laim       4
18        1               agents       6

cs95 · Answer 1 · 2018-12-29T14:27:10.323

This is a fantastic use case for pd.cut:

pd.cut(df['count'], 
       bins=[-np.inf, 1, 10, np.inf], 
       labels=['unique', 'infrequent', 'frequent'])

0       frequent
1     infrequent
2         unique
3     infrequent
4     infrequent
5         unique
6         unique
7     infrequent
8         unique
9         unique
10        unique
11        unique
12        unique
13      frequent
14        unique
15        unique
16        unique
17        unique
18        unique
Name: count, dtype: category
Categories (3, object): [unique < infrequent < frequent]

The disadvantage with np.select in the other answer is that you will need to evaluate all conditions before selection, and will not scale as well with more conditions.

jpp · Answer 2 · 2018-12-29T15:24:21.660

4

For multiple conditions, consider using np.select:

conditions = [data['count'] == 1, data['count'] > 10, data['count'] <= 10]
choices = ['unique', 'frequent', 'infrequent']

data['frequency'] = np.select(conditions, choices)

The order of your conditions is important, as you do not want data['count'] <= 10 to include a count of 1.

You may also wish to consider pd.cut or np.digitize, see also How to map numeric data into categories / bins in Pandas dataframe.

edited Dec 29 '18 at 15:24

answered Dec 29 '18 at 14:14

jpp

159,742
34
281
339

Excellent as always, can you guide me to a good web link for numpy please, not good there. thanks – anky Dec 29 '18 at 14:17
The `data['count'] == 1` should be before `data['count'] <= 10`, otherwise `count == 1` is also classified as "infrequent" (when it should be "unique"). The order of the conditions matter. – cs95 Dec 29 '18 at 14:20
I am not worried about that since I am not the OP. :) I have been a keen follower of both of you @coldspeed if you and jpp can share some good numpy links , that would be very helpful. – anky Dec 29 '18 at 14:30
1

@anky_91 [Here is a hopefully instructive starting point](https://stackoverflow.com/search?tab=votes&q=user%3a4909087%20%5bnumpy%5d%20is%3aa) :) As for how/why I learned, it was mostly out of a requirement to speed up existing pandas code (doing the same thing with numpy is usually faster than pandas because of less overhead). Docs, and other, older answers mainly. – cs95 Dec 29 '18 at 14:33
1

@coldspeed, Good point, thanks. We can just change the order of the conditions. – jpp Dec 29 '18 at 15:11

score 1 · Accepted Answer · answered Dec 29 '18 at 15:46

1

Another alternative to this using DataFrame.apply:

def frequer(wordcnt):
    if wordcnt == 1: return 'unique'
    elif wordcnt >10: return 'frequent'
    else: return 'infrequent'

df['freq'] = df.apply(lambda x: frequer(x['count']),axis=1)

Output:

answered Dec 29 '18 at 15:46

HakunaMaData

1,281
12
26

score 0 · Answer 4 · answered Dec 29 '18 at 14:12

After - please ignore the first row, though, which just gives the overall number of unique words in the text.

data.loc[data["count"] > 10,  "frequency"] = "frequent"
data.loc[data["count"] <= 10, "frequency"] = "infrequent"
data.loc[data["count"] == 1,  "frequency"] = "unique"

result:

  count                 word  length   frequency
0     67223                            0    frequent
1         7               deinen       6  infrequent
2         1          überwachsen      11      unique
3         3                 them       4  infrequent
4         2            fortunens       9  infrequent
5         1              flammen       7      unique
6         1    ersäuentsezlichen      17      unique
7         2              alleino       7  infrequent
8         1             empfehle       8      unique
9         1  beschulöffentlicher      19      unique
10        1         unterthänige      12      unique
11        1                   pr       2      unique
12        1       zurükzutreiben      14      unique
13       38                   wo       2    frequent

All these answers have been great! I gave the solution I used when I posted the question. However my method was really basic, and all these other methods are really useful to know — avwinter, Jan 02 '19 at 16:49
In that case, you can accept your own answer, if you prefer it to all the others posted here. — cs95, Jan 02 '19 at 16:52
thanks for the tip; I don't prefer it over any of the others, but may as well — avwinter, Jan 02 '19 at 16:55

The pandas equivalent of 'if' 'else' conditionals to add calculated column to df

4 Answers4