0

I have a multilevel dataset df on my hands with the following organization:

    
ID     Eye   Video_number  Time  Day   measurement1
40001   L         1         1     1        0.60 
40001   L         2         1     1        0.50 
40001   L         3         1     1        0.80 
40001   L         1         2     1        0.60 
40001   L         2         2     1        0.60 
40001   L         3         2     1        0.60 

Goal I am trying to replace cell values of measurements that have a coefficient of variance above 45 with NA, since these values are probably less precise and should be excluded.

The coefficient of variation(sometimes denoted CV) of a distribution is defined as the ratio of the standard deviation to the mean, with $\mu$ and $\sigma$ values obtained from the raw data

  1. I obtained the CV values by Time units (averaging measurement of three videos in one Time unit) with the following function and for loop. I got help from the following threads:

How to correctly use group_by() and summarise() in a For loop in R

Append data frames together in a for loop

# Define function
cv <- function(x){
  sd(na.omit(x))/mean(na.omit(x))*100}


# Variables
vars <- c("measurement1", "measurement2", "measurement3")

# Create a table with all CV values by ID, Eye, Day, and Time
df_cv=data.frame()
for (i in vars){
  df<-df.m2
  df$values<-df[,which(colnames(df.m2)==i)]
  x<-df%>%
    group_by(ID,Eye,Day,Time) %>%
    summarise(Count = n(),
              Mean = mean(values, na.rm = TRUE),
              SD = sd(values, na.rm = TRUE),
              CV = cv(values))%>%
    mutate(Variable=paste(i,"cv",sep="_"))
  df_cv<-rbind(df_cv,x)
  df_cv$CV[is.nan(df_cv$CV)]<-0    # for 0/0 on CV formula giving NaN

}


  1. It resulted in the following table df_cv:
 ID    Eye Day Time Count      Mean         SD        CV         Variable
40001   L   1   1   3       0.56666667  0.057735027 10.1885342  measurement1_cv
40001   L   1   2   3       0.36666667  0.404145188 110.2214150 measurement1_cv
40001   L   1   3   3       0.50000000  0.000000000 0.0000000   measurement1_cv
  1. I reformatted df_cv above to wide format (Variables and CVs across row rather than down a column). This enabled me to merge the CVs with the original df
df_cv<-dcast(df_cv,PIDN+Eye+Day+Time~Variable,value.var = "CV")
df<-merge(df,df_cv,by=c("PIDN","Eye","Day","Time"))
    
ID     Eye   Video_number  Time  Day   measurement1     measurement1_cv
40001   L         1         1     1        0.60           10.1885342
40001   L         2         1     1        0.50           10.1885342
40001   L         3         1     1        0.80           10.1885342
40001   L         1         2     1        0.80           110.2214150 
40001   L         2         2     1        0.30           110.2214150 
40001   L         3         2     1        0.00           110.2214150 
  1. I know want to input NAs into the cells of measurement 1 that have a CV>45. I know how to do this measurement by measurement, but I was wondering if there was a for loop capable of doing this, since I have a lot of variables I am analyzing.
df$measurement1[df$measurement1_cv>45]<-NA
df$measurement2[df$measurement2_cv>45]<-NA
df$measurement3[df$measurement3_cv>45]<-NA

Below are my failed attempts:

for (i in vars) {
  df<-df.m3
  df$i[df$i_cv>45]<-NA
}

Error in `$<-.data.frame`(`*tmp*`, "i", value = logical(0)) : 
  replacement has 0 rows, data has 609



for (i in vars) {
  df<-df.m3
  df$i[df$paste(i,"_cv")>45]<-NA
}

Error in df$paste(i, "_cv") : attempt to apply non-function


Any help is greatly appreciated!

Shivvy
  • 67
  • 2
  • 6
  • `df[,c(2,4,17,23)] <- lapply(df[,c(2,4,17,23)], function(a) replace(a, a > 45, NA))` where the vector of numbers here are the columns you need. It can also be a logical vector, perhaps derived from `startsWith(colnames(df), "measurement")` – r2evans Oct 06 '20 at 18:24
  • 1
    Thank you! However, I need to apply CV formula by Time instance, which is why I used `group_by(ID,Eye,Day,Time)` I tried the following: `lapply(df[,vars], function(cv) replace(cv,cv>45,NA))` and I don't think it recognizes the unit of Time by which to take the means and sd of the videos. It replaces many values with `NA`. – Shivvy Oct 06 '20 at 18:49

1 Answers1

0

Two thoughts.

But first, augmenting your data a little.

df$measurement2 <- 0.9; df$measurement2_cv <- c(rep(46, 3), rep(44, 3))
library(dplyr)
  1. inline function that operates only on one group at a time (i.e., it ignores grouping, so it must be within dplyr::do).

    myfunc <- function(x) {
      CV <- grep("^measurement.*_cv", colnames(x), value = TRUE)
      MEAS <- gsub("_cv$", "", CV)
      CV <- CV[MEAS %in% colnames(x)]
      MEAS <- MEAS[MEAS %in% colnames(x)]
      x[,MEAS] <- Map(function(meas, cv) replace(meas, cv > 45, NA), x[,MEAS], x[,CV])
      x
    }
    df %>%
      group_by(ID, Eye, Day, Time) %>%
      do(myfunc(.))
    # # A tibble: 6 x 9
    # # Groups:   ID, Eye, Day, Time [2]
    #      ID Eye   Video_number  Time   Day measurement1 measurement1_cv measurement2 measurement2_cv
    #   <int> <chr>        <int> <int> <int>        <dbl>           <dbl>        <dbl>           <dbl>
    # 1 40001 L                1     1     1          0.6            10.2         NA                46
    # 2 40001 L                2     1     1          0.5            10.2         NA                46
    # 3 40001 L                3     1     1          0.8            10.2         NA                46
    # 4 40001 L                1     2     1         NA             110.           0.9              44
    # 5 40001 L                2     2     1         NA             110.           0.9              44
    # 6 40001 L                3     2     1         NA             110.           0.9              44
    
  2. Pivot, calculate, then unpivot, using tidyr.

    library(tidyr) # pivot_*
    df %>%
      rename_with(.fn = ~ paste0(., "_val"), .cols = matches("^meas.*[^v]$")) %>%
      rename_with(.fn = ~ gsub("(.*)_(.*)", "\\2_\\1", .), .cols = starts_with("meas")) %>%
      pivot_longer(., matches("meas"), names_sep = "_", names_to = c(".value", "meas")) %>%
      mutate(val = if_else(cv > 45, NA_real_, val)) %>%
      pivot_wider(1:5, names_from = "meas", names_sep = "_", values_from = c("val", "cv"))
    # # A tibble: 6 x 9
    #      ID Eye   Video_number  Time   Day val_measurement1 val_measurement2 cv_measurement1 cv_measurement2
    #   <int> <chr>        <int> <int> <int>            <dbl>            <dbl>           <dbl>           <dbl>
    # 1 40001 L                1     1     1              0.6             NA              10.2              46
    # 2 40001 L                2     1     1              0.5             NA              10.2              46
    # 3 40001 L                3     1     1              0.8             NA              10.2              46
    # 4 40001 L                1     2     1             NA                0.9           110.               44
    # 5 40001 L                2     2     1             NA                0.9           110.               44
    # 6 40001 L                3     2     1             NA                0.9           110.               44
    

    (I admit that this looks/feels wonky, should need all of that renaming. But now you need to un-rename for the final step.) Perhaps the first solution is cleaner/better :-)

r2evans
  • 141,215
  • 6
  • 77
  • 149