2

I need to create a column in a dataset that reports the most recent row-wise modal text value in a selection of columns (ignoring NAs).

Background: I've a dataset where up to 4 coders rated participant transcripts (one participant/row). Occasionally a minority of coders either disagree or select the wrong code for a participant/row. So I need to reproducibly select the modal code response across coders for each participant (i.e., for each row) and—when there is a tie—select the most recent (later) modal code responses (because later codings are more likely to be correct).

Here's a fake example of the dataset with four coder's codes (Essay or Chat) for 3 participants (one/row).

> fakeData = data.frame(id = 1:3,
+                 Condition = c("Essay", "Chat", "Chat"),
+                 FirstCoder = c("NA","Essay","Essay"),
+                 SecondCoder = c("NA","Chat","Essay"),
+                 ThirdCoder = c("Essay","Chat","Chat"),
+                 FourthCoder = c("Essay","NA","Chat"))
> fakeData
  id Condition FirstCoder SecondCoder ThirdCoder FourthCoder
1  1     Essay         NA          NA      Essay       Essay
2  2      Chat      Essay        Chat       Chat          NA
3  3      Chat      Essay       Essay       Chat        Chat

Regarding recency: The "FirstCoder" coded first, "SecondCoder" coded next, then the "ThirdCoder" submitted their code, and "FourthCoder" was the last (and most recent) coder to submit a response.

Here are some methods I've tried from other forums—notice how I need to ignore the "Condition" column:

> fakeData$ModalCode1 <- apply(fakeData,1,function(x) names(which.max(table(c("FirstCoder","SecondCoder", "ThirdCoder", "FourthCoder")))))
> fakeData$ModalCode2 <- apply(select(fakeData,ends_with("Coder")), 1, Mode)

The correct result would be this column (created manually)

> fakeData$MostRecentModalCode <- c("Essay", "Chat", "Chat")

You can see that none of my attempts are getting the correct result (i.e., "MostRecentModalCode").

> fakeData
  id Condition FirstCoder SecondCoder ThirdCoder FourthCoder ModalCode1 ModalCode2 MostRecentModalCode
1  1     Essay         NA          NA      Essay       Essay FirstCoder         NA               Essay
2  2      Chat      Essay        Chat       Chat          NA FirstCoder       Chat                Chat
3  3      Chat      Essay       Essay       Chat        Chat FirstCoder      Essay                Chat

As you can see the final (correct) column ignores NAs and breaks modal ties with the more recent coders' responses (unlike the traditional Mode function).

Surely there's a function for this, but I am just failing to find or correctly implement it.

Advice and solutions welcome! (If I have to create a custom function, that's fine—albeit surprising.)

Nick Byrd
  • 163
  • 1
  • 14

5 Answers5

4

We can use the Mode function from here

> Mode <- function(x) {
+   ux <- unique(x)
+   ux[which.max(tabulate(match(x, ux)))]
+ }
> 
> apply(fakeData[-1], 1, Mode)
[1] "Essay" "Chat"  "Chat" 
akrun
  • 874,273
  • 37
  • 540
  • 662
  • The dataset I'm working with has over 100 columns and 1825 rows. I get the following error when applying that function to a new column: "Error in set(x, j = name, value = value) : Supplied 1824 items to be assigned to 1825 items of column '[the new column]'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code." – Nick Byrd Apr 07 '23 at 18:29
  • When I use select() to get the right columns, it *sort of* works, but it's not ignoring NAs. (In other words, if NA is the most frequent value, it thinks NA is the mode). – Nick Byrd Apr 07 '23 at 18:33
  • @NickByrd is your data `data.frame` or `data.table`? I assumed it is data.frame, thus I used `fakeData[-1]`. If it is data.table, it can be `fakeData[, -1, with = FALSE]` – akrun Apr 07 '23 at 18:34
  • @NickByrd The second comment doesn't seem to be TRUE for your fake data. If I replace the first row with NA in one of the columns to make it the most frequent value, it still works for me. `> fakeData$ThirdCoder[1] <- NA# > apply(fakeData[-1], 1, Mode) [1] "Essay" "Chat" "Chat"` – akrun Apr 07 '23 at 18:37
  • 1
    @NickByrd if you are still getting NA as output in your orgiinal data, it implies, your NA may be character string like `"NA"` instead of `NA`. So, you may need to replace the `"NA"` to `NA` before applying the code – akrun Apr 07 '23 at 18:39
  • I replaced the "NA" with actual NA, but it's still reporting NA as the mode when most row cells are NA in my data.table. To do this in each instance needed in my real data, I seem to need to use select()—the equivalent of "select(fakeData,ends_with("Coder")"—in order to select the right column(s) for each mode from the over 100 columns. I can't get the function above to work with select(). It only works with your "fakeData[, -1, with = FALSE]" suggestion (which doesn't transfer to my real data). – Nick Byrd Apr 07 '23 at 19:10
  • 1
    @NickByrd For data.table, you can use the syntax, `fakeData[, apply(.SD, 1, Mode), .SDcols = patterns("Coder$")]` – akrun Apr 08 '23 at 07:06
1

@akrun's answer pointed me to another post that had a custom Mode function buried in the answers that fit my needs. I've renamed it ModeC, adapted from Mode in @DanHoughton's answer (https://stackoverflow.com/a/53290748/1701844).

ModeC <- function(x) {
  if ( length(x) <= 2 ) return(x[1])
  if ( anyNA(x) ) x = x[!is.na(x)]
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

For reasons I do not understand, it fails to ignore NAs on the fakeData (whether its a data.table or a data.frame and even when the NAs are not just "NA" strings), but it correctly ignores NAs when determining the mode in my actual data. So I am posting it here in case it works for others.

Nick Byrd
  • 163
  • 1
  • 14
1

If you work with data.table, you can try the code below

library(data.table)

melt(setDT(fakeData),
  id.vars = "id", na.rm = TRUE
)[
  , .N,
  .(id, value)
][
  , .(value = value[which.max(N)]),
  id
]

which gives

   id value
1:  1 Essay
2:  2  Chat
3:  3  Chat
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
0

What about:

apply(fakeData[,-1], 1, DescTools::Mode, na.rm=TRUE)

?

Andri Signorell
  • 1,279
  • 12
  • 23
0

You can use:

apply(fakeData[-1], 1, \(x) names(which(max(table(x))==table(x))))
#[1] "Essay" "Chat"  "Chat" 

Which will return all most frequent levels in case there are more than one.

GKi
  • 37,245
  • 2
  • 26
  • 48