0

I have a data frame that looks like this:

df <- data.frame(Set = c("A","A","A","B","B","B","B"), Values=c(1,1,2,1,1,2,2))

I want to collapse the data frame so I have one row for A and one for B. I want the Values column for those two rows to reflect the most common Values from the whole dataset.

I could do this as described here (How to find the statistical mode?), but notably when there's a tie (two values that each occur once, therefore no "true" mode) it simply takes the first value.

I'd prefer to use my own hierarchy to determine which value is selected in the case of a tie.

pfadenhw
  • 119
  • 6

1 Answers1

0

Create a data frame that defines the hierarchy, and assigns each possibility a numeric score.

hi <- data.frame(Poss = unique(df$Set), Nums =c(105,104))

In this case, A gets a numerical value of 105, B gets a numerical score of 104 (so A would be preferred over B in the case of a tie).

Join the hierarchy to the original data frame.

require(dplyr)
matched <- left_join(df, hi, by = c("Set"="Poss"))

Then, add a frequency column to your original data frame that lists the number of times each unique Set-Value combination occurs.

setDT(matched)[, freq := .N, by = c("Set", "Value")]

Now that those frequencies have been recorded, we only need row of each Set-Value combo, so get rid of the rest.

multiplied <- distinct(matched, Set, Value, .keep_all = TRUE)

Now, multiply frequency by the numeric scores.

multiplied$mult <- multiplied$Nums * multiplied$freq

Lastly, sort by Set first (ascending), then mult (descending), and use distinct() to take the highest numerical score for each Value within each Set.

check <- multiplied[with(multiplied, order(Set, -mult)), ]

final <- distinct(check, Set, .keep_all = TRUE)

This works because multiple instances of B (numerical score = 104) will be added together (3 instances would give B a total score in the mult column of 312) but whenever A and B occur at the same frequency, A will win out (105 > 104, 210 > 208, etc.).

If using different numeric scores than the ones provided here, make sure they are spaced out enough for the dataset at hand. For example, using 2 for A and 1 for B doesn't work because it requires 3 instances of B to trump A, instead of only 2. Likewise, if you anticipate large differences in the frequencies of A and B, use 1005 and 1004, since A will eventually catch up to B with the scores I used above (200 * 104 is less than 199 * 205).

pfadenhw
  • 119
  • 6