1

Hey guys I am working on a very large data set, (200 columns, 750,000 rows). There are a few cells with the value -1 and I want to replace each of those cells based on the mode (highest repeating value) of that specific column and not just one standard thing for all -1's. For example -1 in column A be replaced by mode on column A, -1 in column B replaced by mode of column B.

Is there a way to deal with this without having to make new 200 columns and 750,000 rows using if function?

Thank you for your time.

Regards, InWoords

InWoods
  • 105
  • 1
  • 2
  • 9

1 Answers1

3

We can use the Mode function from here

Mode <- function(x) {
   ux <- unique(x)
   ux[which.max(tabulate(match(x, ux)))]
 }

Then, with mutate_all replace the values of -1 with the 'Mode' of each column

library(dplyr)
df1 %>%
     mutate_all(funs(replace(., .== -1, Mode(.))))

It is not clear whether -1 occurs as the most frequent value in a column. In that exclude that in the mutate_all

df1 %>%
     mutate_all(funs(replace(., .== -1, Mode(.[. != -1]))))

Also, if there are columns without -1, then we can use mutate_if to skip those

df1 %>%
   mutate_if(funs(any(.==-1)),funs(replace(., .== -1, Mode(.[. != -1]))))

data

df1 <- structure(list(V1 = c(1L, 0L, 3L, 2L, 3L, 5L, 0L, 4L, 4L, 0L), 
V2 = c(3L, 1L, 3L, 3L, 1L, 5L, 0L, -1L, 2L, -1L), V3 = c(-1L, 
2L, 4L, -1L, 0L, 3L, -1L, 3L, 3L, -1L), V4 = c(1L, 0L, 1L, 
0L, 5L, 0L, 4L, 3L, 3L, 5L), V5 = c(2L, 1L, 0L, 0L, -1L, 
-1L, 1L, 1L, 0L, 1L)), .Names = c("V1", "V2", "V3", "V4", 
 "V5"), row.names = c(NA, -10L), class = "data.frame")
akrun
  • 874,273
  • 37
  • 540
  • 662