3

the below table counts unique words in a text (German text of Hamlet in this case).

Using Pandas I would like to add a column['frequency'] that prints one of three answers.

  • If the value in the 'count' column is <=10 the frequency is 'infrequent'

  • If the value in the 'count' column is >10 the frequency is 'frequent'

  • If the value in the 'count' column is 1 the frequency is 'unique'

I am new to pandas so I initially thought I would have to use a 'for' loop and 'if' 'else'. Of course, that didn't work for me and after reading around this I see you can just use .loc[] instead. It's much cleaner.

I'll put the answer below in case anyone else needs this setting out really clearly. Here's the table I'm working with before-

      count                 word  length
0     67223                            0
1         7               deinen       6
2         1          überwachsen      11
3         3                 them       4
4         2            fortunens       9
5         1              flammen       7
6         1    ersäuentsezlichen      17
7         2              alleino       7
8         1             empfehle       8
9         1  beschulöffentlicher      19
10        1         unterthänige      12
11        1                   pr       2
12        1       zurükzutreiben      14
13       38                   wo       2
14        1          schadhaften      11
15        1               ddiese       6
16        1         zurükhaltend      12
17        1                 laim       4
18        1               agents       6
Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
avwinter
  • 83
  • 2
  • 10

4 Answers4

8

This is a fantastic use case for pd.cut:

pd.cut(df['count'], 
       bins=[-np.inf, 1, 10, np.inf], 
       labels=['unique', 'infrequent', 'frequent'])

0       frequent
1     infrequent
2         unique
3     infrequent
4     infrequent
5         unique
6         unique
7     infrequent
8         unique
9         unique
10        unique
11        unique
12        unique
13      frequent
14        unique
15        unique
16        unique
17        unique
18        unique
Name: count, dtype: category
Categories (3, object): [unique < infrequent < frequent]

The disadvantage with np.select in the other answer is that you will need to evaluate all conditions before selection, and will not scale as well with more conditions.

cs95
  • 379,657
  • 97
  • 704
  • 746
4

For multiple conditions, consider using np.select:

conditions = [data['count'] == 1, data['count'] > 10, data['count'] <= 10]
choices = ['unique', 'frequent', 'infrequent']

data['frequency'] = np.select(conditions, choices)

The order of your conditions is important, as you do not want data['count'] <= 10 to include a count of 1.

You may also wish to consider pd.cut or np.digitize, see also How to map numeric data into categories / bins in Pandas dataframe.

jpp
  • 159,742
  • 34
  • 281
  • 339
  • Excellent as always, can you guide me to a good web link for numpy please, not good there. thanks – anky Dec 29 '18 at 14:17
  • The `data['count'] == 1` should be before `data['count'] <= 10`, otherwise `count == 1` is also classified as "infrequent" (when it should be "unique"). The order of the conditions matter. – cs95 Dec 29 '18 at 14:20
  • I am not worried about that since I am not the OP. :) I have been a keen follower of both of you @coldspeed if you and jpp can share some good numpy links , that would be very helpful. – anky Dec 29 '18 at 14:30
  • 1
    @anky_91 [Here is a hopefully instructive starting point](https://stackoverflow.com/search?tab=votes&q=user%3a4909087%20%5bnumpy%5d%20is%3aa) :) As for how/why I learned, it was mostly out of a requirement to speed up existing pandas code (doing the same thing with numpy is usually faster than pandas because of less overhead). Docs, and other, older answers mainly. – cs95 Dec 29 '18 at 14:33
  • 1
    @coldspeed, Good point, thanks. We can just change the order of the conditions. – jpp Dec 29 '18 at 15:11
1

Another alternative to this using DataFrame.apply:

def frequer(wordcnt):
    if wordcnt == 1: return 'unique'
    elif wordcnt >10: return 'frequent'
    else: return 'infrequent'

df['freq'] = df.apply(lambda x: frequer(x['count']),axis=1)

Output:

enter image description here

HakunaMaData
  • 1,281
  • 12
  • 26
0

After - please ignore the first row, though, which just gives the overall number of unique words in the text.

data.loc[data["count"] > 10,  "frequency"] = "frequent"
data.loc[data["count"] <= 10, "frequency"] = "infrequent"
data.loc[data["count"] == 1,  "frequency"] = "unique"

result:

  count                 word  length   frequency
0     67223                            0    frequent
1         7               deinen       6  infrequent
2         1          überwachsen      11      unique
3         3                 them       4  infrequent
4         2            fortunens       9  infrequent
5         1              flammen       7      unique
6         1    ersäuentsezlichen      17      unique
7         2              alleino       7  infrequent
8         1             empfehle       8      unique
9         1  beschulöffentlicher      19      unique
10        1         unterthänige      12      unique
11        1                   pr       2      unique
12        1       zurükzutreiben      14      unique
13       38                   wo       2    frequent
avwinter
  • 83
  • 2
  • 10
  • All these answers have been great! I gave the solution I used when I posted the question. However my method was really basic, and all these other methods are really useful to know – avwinter Jan 02 '19 at 16:49
  • In that case, you can accept your own answer, if you prefer it to all the others posted here. – cs95 Jan 02 '19 at 16:52
  • thanks for the tip; I don't prefer it over any of the others, but may as well – avwinter Jan 02 '19 at 16:55