1

I need to change the value of a group label of rows if they do not have enough points. For example,

+-----+
|c1|c2|
+-----+
|A |1 |
|A |2 |
|B |1 |
|A |2 |
|E |5 |
|E |6 |
|W |1 |
+-----+

If I were to group on the value within c2 and the minimum number of points within each group has to be greater than or equal to 2.

c2:
1 : count(c1) = 3
2 : count(c1) = 2
5 : count(c1) = 1
6 : count(c1) = 1

Clearly, groups 5 and 6 have only 1 element in each so then I would like to relabel those row's c2 values to -1.

This can be seen below.

+-----+
|c1|c2|
+-----+
|A |1 |
|A |2 |
|B |1 |
|A |2 |
|E |-1|
|E |-1|
|W |1 |
+-----+

This is the code I have written, however it is not updating the dataframe.

labels = df["c2"].unique()
for l in labels:
    group_size = df[DB["c2"]==l].shape[0]
    if group_size<=minPts:
        df[df["c2"]==l]["c2"] = -1
ekad
  • 14,436
  • 26
  • 44
  • 46
Bryce Ramgovind
  • 3,127
  • 10
  • 41
  • 72
  • 1
    nice little DataFrame up there! just one little request here, if you could also make it copy-pastable for people, such that we could copy your dataframe directly into the repl and work on your problem, it'll be even better. take a look [here](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples), you will find it helpful! thanks ! – stucash Dec 12 '17 at 12:13

1 Answers1

1

You can use value_counts, then filter and last set values by mask with isin:

s = df['c2'].value_counts()
s = s.index[s < 2]
print (s)
Int64Index([6, 5], dtype='int64')

df.loc[df['c2'].isin(s), 'c2'] = -1
print (df)
  c1  c2
0  A   1
1  A   2
2  B   1
3  A   2
4  E  -1
5  E  -1
6  W   1

Detail:

print (df['c2'].value_counts())
1    3
2    2
6    1
5    1
Name: c2, dtype: int64

print (df['c2'].isin(s))
0    False
1    False
2    False
3    False
4     True
5     True
6    False
Name: c2, dtype: bool
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252