5

I need to calculate the majority vote for an item in R and I don't have a clue how to approach this.

I have a data frame with items and assigned categories. What I need is the category that was assigned the most often. How do I do this?

Data frame:

item   category
1      2
1      3
1      2
1      2
2      2
2      3
2      1
2      1

Result should be:

item   majority_vote
1      2
2      1
nantoki
  • 57
  • 1
  • 5
  • take a look at the `table` function and the `plyr` package. However, this is a very common data manipulation and you would likely benefit from reading any of the excellent R tutorials on the "split-apply-combine" strategy of data processing. – Justin Jun 19 '13 at 21:35
  • Apologies, I'm away from my pc with R on so can't provide code. I think you're after the mode for an item. In combination with @Justin's answer it should give you what you need. – Steph Locke Jun 19 '13 at 21:37
  • Thanks, I'll look at the things you suggested and, of course, at all the other strategies suggested. I'm impressed, I didn't expect that there would be that many ways of approaching this. – nantoki Jun 20 '13 at 07:54

4 Answers4

6

You could use two things here. First, this is how you get the most frequent item in a vector:

> v = c(1,1,1,2,2)
> names(which.max(table(v)))
[1] "1"

This is a character value, but we can easily to an as.numeric on it if necessary.

Once we know how to do that, we can use the grouping functionality of the data.table package to perform a per-item evaluation of what its most frequent category is. Here is the code for your example above:

> dt = data.table(item=c(1,1,1,1,2,2,2,2), category=c(2,3,2,2,2,3,1,1))
> dt
   item category
1:    1        2
2:    1        3
3:    1        2
4:    1        2
5:    2        2
6:    2        3
7:    2        1
8:    2        1
> dt[,as.numeric(names(which.max(table(category)))),by=item]
   item V1
1:    1  2
2:    2  1

The new V1 column contains the numeric version of the most frequent category for each item. If you want to give it a proper name, the syntax is a little uglier:

> dt[,list(mostFreqCat=as.numeric(names(which.max(table(category))))),by=item]
   item mostFreqCat
1:    1           2
2:    2           1
asieira
  • 3,513
  • 3
  • 23
  • 23
  • 3
    I'd avoid `table` and would do smth like this instead: `dt[, .SD[, .N, by = category][order(-N)][1], by = item]` – eddi Jun 19 '13 at 22:00
3

One liner (using plyr):

ddply(dt, .(item), function(x) which.max(tabulate(x$category)))
topchef
  • 19,091
  • 9
  • 63
  • 102
1
 tdat <- tapply(dat$category, dat$item, function(vec) sort(table(vec), 
                                                 decreasing=TRUE)[1] )
 data.frame(item=rownames(tdat), plurality_vote=tdat)

  item plurality_vote
1    1              3
2    2              2

A more complex function would be needed to distinguish a plurality (possibly with ties) from a true majority.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
1

If you have a function to calculate the mode, as in package prettyR, you can use aggregate:

require(prettyR)

aggregate(d$category, by=list(item=d$item), FUN=Mode)
#  item x
#1    1 2
#2    2 1
Ferdinand.kraft
  • 12,579
  • 10
  • 47
  • 69