0

Let's say I'm tracking the answers to 100 problems on a quiz taken by 1000 students. The majority of students give one of four or five wrong answers, but a larger number also give wildly incorrect answers that very few other students give. How can I turn all those wildly wrong answers into a new value of "wildly wrong" while keeping the other ones? For the sake of example, let's say for one question 200 students give one answer, 150 give a second, 160 give a third, and 490 give something that no one else gave. For another, 80 students gave one answer, 50 a second, and 30 a third, but 840 gave something no one else gave. I want to turn the 490 for one question and 840 for the other all into "wildly wrong".

I looked at purr, but I think I'm missing something that could automate given that I want say top 3 answers to remain unchanged but the rest changed.

Shortening the numbers for the sake of example:

a1 <- c("rna", "rna", "dna", "dna", "cell", "cell", "cell", "hair", "nail", "finger", "toe", "scallop", "brow", "mitosis", "my toes is")  
a2 <- c("darwin", "darwin", "darwin", "einstein", "einstein", "einstein", "einstein", "pollack", "newton", "leibniz", "johnson", "no idea", "you", "me", "no one")  
a3 <- c("5.5", "5.5", "5.6", "5.5", "5.4", "5.2", "5.4", "5.6", "2", "3", "1", "-1", "5.5", "-5.5", "72.4")  

df <- data.frame(a1, a2, a3)

Afterwards, I'm trying to get:

> plyr::count(df$a1)
1        cell    3
2         dna    2
3         rna    2
4        wild    8

> plyr::count(df$a2)
1      darwin    3
2     einstein   4
3         wild   8

> plyr::count(df$a3)
1          5.4   2
2          5.5   4
3          5.6   2
4      the rest  7
Sleepy Miles
  • 169
  • 1
  • 1
  • 9

2 Answers2

2

Maybe you unintentionally have your columns class as factors but we can explore that and get the desired output

n <- 1

sapply(df, function(x) {
   temp <- as.character(factor(x, exclude = names(which(table(x) <= n))))
   temp[is.na(temp)] <- "wild"
   table(temp)                     
})

#$a1
#temp
#cell  dna  rna wild 
#   3    2    2    8 

#$a2
#temp
#  darwin einstein     wild 
#       3        4        8 

#$a3
#temp
# 5.4  5.5  5.6 wild 
#   2    4    3    6 
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • 1
    Nice use of `exclude`, though this doesn't quite match OP's description of *"top 3 answers to remain unchanged"*. Worth pointing out the difference between the number-of-responses `n` threshold in this answer vs the top `n` most common answers to keep in mine. – Gregor Thomas Nov 21 '19 at 03:40
  • ohh..yeah. I misunderstood what `n` means in my answer. I guess it can be updated with `exclude = names(sort(table(x), decreasing = TRUE)[-(1:3)])))` but I am not sure how to handle the 2nd column case where there are only 2 values more than 1. – Ronak Shah Nov 21 '19 at 03:48
1

Here's a pretty straightforward loop:

# change any factors to character
factors = sapply(df, is.factor)
df[factors] = lapply(df[factors], as.character)

# replace less common answer with "wild"
n = 3
for(i in seq_along(df)) {
  counts = table(df[[i]])
  if(length(counts) <= n) next
  keepers = names(counts)[rank(-counts) <= 3]
  df[[i]][!df[[i]] %in% keepers] = "wild"
}
lapply(df, table)
# $a1
# 
# cell  dna  rna wild 
#    3    2    2    8 
# 
# $a2
# 
#   darwin einstein     wild 
#        3        4        8 
# 
# $a3
# 
#  5.4  5.5  5.6 wild 
#    2    4    3    6 

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Thanks! I got NAs for some reason, but that's easy to solve. Are loops still OK to use in R? I was under the impression that the tidyverse frowns on them, preferring apply() collective. – Sleepy Miles Nov 21 '19 at 02:52
  • 1
    For loops vs apply is all about readability -- most all performance talk is a poorly understood distraction. tidyverse has moved away from apply, preferring `purrr` functions. You could certainly rewrite this using `lapply` or `purrr` functions. I think it's very readable and straightforward as written, so I'll leave any rewriting to you. – Gregor Thomas Nov 21 '19 at 02:56
  • If you want to learn more, I'd suggest reading through the answers to [Is R's apply family more than syntactic sugar?](https://stackoverflow.com/q/2275896/903061). **All** the answers are good, and reading through them all is the best way to get a broad context. – Gregor Thomas Nov 21 '19 at 02:59
  • Thanks! I'll have a look at that answer. I don't really have a preference. I'm more familiar with loops anyway, but have been generally steered away from them. – Sleepy Miles Nov 21 '19 at 03:05
  • Slight hiccup. If there are two answers tied for 3rd place, your code will change both of them, rather than give both of them. I changed the sample data set in order to test that. Otherwise things are working smoothly. – Sleepy Miles Nov 21 '19 at 04:47
  • (a) Not sure I understand - running my code on your updated example still gives the desired the result. Maybe you're counting ranks differently? (b) You can look at `?rank` and choose whatever tie method you want, there are 7 options for the `ties.method` argument. – Gregor Thomas Nov 21 '19 at 05:24
  • (c) However, if you want something "custom", like *if there are two ties for 3rd place, keep them both, but if there are more than 2, change them all*, you may have to do some special handling. It might work to use the (default) `average` tie method and change the threshold to `<= 3.5`, but do note that your `a2` has an 8-way tie for third place, and you seem to want them all changed. Otherwise, you may have to better articulate your requirements and do some custom checking of cases. – Gregor Thomas Nov 21 '19 at 05:27
  • And re:(a), your data modification created a tie for 2nd place, not a tie for 3rd, so that may be that's what is causing my confusion. That said, based on (b) and (c) I bet you can adapt the answer to your needs. – Gregor Thomas Nov 21 '19 at 05:30
  • Thanks for the tip, and I'll look into rank() and figure it out. Much appreciated! – Sleepy Miles Nov 23 '19 at 17:09