0

I can not figure this out. I have a data frame

id=c(1,2,3,4,2,6,1,1,6,5,4,2)
per=c(0.1,0.9,0.6,0.5,0.8,0.9,0.2,0.3,0.7,0.5,0.4,0.3)
df=data.frame(id=id,per=per)

I want to divide the "per" column in three conditions, lets say, between 0 and 0.3 (we assign a 3), 0.3 and 0.7 (we assign a 2), and 0.7 and 1 (we assign a 1).

My idea is to assign each unique id to the largest count of this division, i.e., if for id=1 there are more "per" in the 0.7-1 range, then that id corresponds to that division, i.e., id = 1 corresponds to "1". So the example would look like:

 id class
  1     3
  2     1
  3     2
  4     2
  5     2
  6     1

I found this

R- selecting a row based on characteristics of another column in that row

but I need the previous step, i.e., the classification, to reach that point.

Thank you!

Community
  • 1
  • 1
Andres
  • 281
  • 2
  • 13
  • 1
    Your definition is not clear for edge cases: should 0.3 be assigned a 3 or a 2? Similarly, 0.7 is a 2 or a 1? – Ricky Aug 06 '15 at 08:17
  • You are right, I apologize. It should be: 1: 0 <= x < 0.3, 2: 0.3 <= x < 0.7, 3: 0.7 <= x < 1 – Andres Aug 06 '15 at 08:19
  • 2
    Another option is `c(3, 2, 1)[findInterval(per, c(0, 0.3, 0.7, 1))]` – akrun Aug 06 '15 at 08:24
  • One more: if there is a tie (e.g.there is one each of class 2 and 3 for id 6), which class should be assigned to the id? – Ricky Aug 06 '15 at 08:29
  • Yep, forgot that :) If there is a tie, it goes to the lower class. In your example, it goes to class 2. I guess in the future I could implement looking at the "per" itself and make a decision based on that, but for now, it goes to the lower class. Thanks! – Andres Aug 06 '15 at 08:36

2 Answers2

1

You can easily achieve this using the cut function in R:

# specify cut, and labels
class <- cut(per, breaks = c(0, 0.3, 0.7, 1), labels = c(3, 2, 1))

#cbind with original data frame
df_new <- cbind(df, class)

#view
df_new

#     id  per   class
# 1   1   0.1     3
# 2   2   0.9     1
# 3   3   0.6     2

Hope this helps!

UPDATE:

# use dplyr package to summarise
(df_stats <- df_new %>% group_by(id,class) %>% summarise(count=n()))

For a given id, the higher the count, the higher the likelyhood that the id belongs to that corresponding class.

Deolu A
  • 752
  • 1
  • 6
  • 13
  • Thanks! Yes, this helps to add the class, but now, and sorry if this is obvious, how to I know that id = 1, for example, corresponds to which class (in the example here, id =1 is a class 3, since it has more "3" than "2" or "1"). – Andres Aug 06 '15 at 08:09
  • If I understand you correctly, you want an `id` that will signify which majority class it contains? If I'm right, I've modified my answer in a way that should help. – Deolu A Aug 06 '15 at 08:34
1

First assign the classes

cl <- cut(per, breaks = c(0, 0.3, 0.7, 1), labels = c(3, 2, 1), right=FALSE)

The parameter right=FALSE to handle edge cases as you specified in comments.

Then find number of classes for each id

chk <- table(id, cl)

Result is

> chk
   cl
id  3 2 1
  1 2 1 0
  2 0 1 2
  3 0 1 0
  4 0 2 0
  5 0 1 0
  6 0 0 2

Then find the column name with the highest value in a row. Assuming ties when there is same number of class in id are resolved by picking the last label (in this case, the lower number one)

output <- apply(chk, 1, function(x) names(rev(which(x==max(x))))[1])

Result is

> output
  1   2   3   4   5   6 
"3" "1" "2" "2" "2" "1" 
Ricky
  • 4,616
  • 6
  • 42
  • 72
  • Sorry to bother you again, but, how do I keep track of the ids? My original ids do not go from 1 to n, but they are large numbers, and "output" only has one column, the "class". Sorry, and thanks! – Andres Aug 06 '15 at 09:27
  • In `output` displayed above, the first row (i.e. 1, 2, 3, 4 etc) are not sequence numbers, but the actual values in `id`. If you replace `id` values (e.g. with "a", "b", "c" etc) you should still get the right label. If you want to store it into a variable, simply use `names(output)` or something like `output.df <- data.frame(id=names(output), class=output)` – Ricky Aug 06 '15 at 09:33