1

I have a dataframe that I want to calculate the majority vote by a factor, e.g.

 item   category
 1      2
 1      3
 1      2
 1      2
 2      2
 2      3
 2      1
 2      1

The output should be

item   majority_vote
1      2
2      NA

You may recognize the example data from here, but I don't want the Mode, I want to get the actual majority vote (meaning more than 1/2 the people selected that option). Hence 'item 2' should have no majority.

table() doesn't seem to help me because which.max() will only give me the modal value. I need to know 3 things, the number of votes I have, the name of that option, and the number of times someone voted for an option. I can get the first two with tapply(all_results_filtered$q1, all_results_filtered$X_row_id ,function(x) length(x)) and tapply(all_results_filtered$q1, all_results_filtered$X_row_id ,function(x) as.numeric(names(which.max(table(x))))), but how can I get the number of the votes for which.max(table(x))

Or... is there some simpler way that I'm missing? Thanks!

Community
  • 1
  • 1
jrubins
  • 187
  • 13
  • `aggregate(category ~ item, df, function(x){y <- x[prop.table(table(x)) > 0.5]; ifelse(any(is.null(y)), NA, unique(y))})`, but there may be a simpler option – alistaire Oct 17 '16 at 22:41
  • Ah! Stealing Psidom's indexing from below, a reasonably nice base version: `aggregate(category ~ item, df, function(x){x[prop.table(table(x)) > 0.5][1]})` – alistaire Oct 17 '16 at 22:58

1 Answers1

1

Here is a dplyr option:

library(dplyr)
df %>% 
      group_by(item, category) %>% 
      mutate(votes = n()) %>% 
      group_by(item) %>% 
      summarise(majority_vote = category[votes > n()/2][1])

# A tibble: 2 x 2
#   item majority_vote
#  <int>         <int>
#1     1             2
#2     2            NA
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • 2
    You could simplify: `df %>% count(item, category) %>% summarise(majority_vote = category[n > sum(n / 2)][1])` – alistaire Oct 17 '16 at 22:56
  • `count` and `summarise` always peel off the last grouping variable, so as long as the parameters of `group_by`/`count` are `item, category` instead of the other way around, it will already be grouped by `item` for the `summarise`. You do have to add `sum`, though, because `count` summarizes instead of mutates. – alistaire Oct 17 '16 at 23:05
  • Actually, @alistaire's comment gives the better answer because it guarantees that votes > n/2. Your comment gives the answer of 1 when the case is votes = 3,0,1,1. Alistare's correctly gives NA. – jrubins Oct 17 '16 at 23:06
  • @alistaire Interesting, didn't know `count` returns a grouped data frame before. Excellent! – Psidom Oct 17 '16 at 23:06
  • 1
    Possible data.table translation - `dat[, .N, by=.(item,category)][, if(any(N/sum(N) > 0.5)) .(category=category[which.max(N)]) else NA_integer_, by=item]` – thelatemail Oct 17 '16 at 23:12