I have a multilevel dataset df
on my hands with the following organization:
ID Eye Video_number Time Day measurement1
40001 L 1 1 1 0.60
40001 L 2 1 1 0.50
40001 L 3 1 1 0.80
40001 L 1 2 1 0.60
40001 L 2 2 1 0.60
40001 L 3 2 1 0.60
Goal I am trying to replace cell values of measurements that have a coefficient of variance above 45 with NA
, since these values are probably less precise and should be excluded.
The coefficient of variation(sometimes denoted CV) of a distribution is defined as the ratio of the standard deviation to the mean, with $\mu$ and $\sigma$ values obtained from the raw data
- I obtained the CV values by Time units (averaging measurement of three videos in one Time unit) with the following function and for loop. I got help from the following threads:
How to correctly use group_by() and summarise() in a For loop in R
Append data frames together in a for loop
# Define function
cv <- function(x){
sd(na.omit(x))/mean(na.omit(x))*100}
# Variables
vars <- c("measurement1", "measurement2", "measurement3")
# Create a table with all CV values by ID, Eye, Day, and Time
df_cv=data.frame()
for (i in vars){
df<-df.m2
df$values<-df[,which(colnames(df.m2)==i)]
x<-df%>%
group_by(ID,Eye,Day,Time) %>%
summarise(Count = n(),
Mean = mean(values, na.rm = TRUE),
SD = sd(values, na.rm = TRUE),
CV = cv(values))%>%
mutate(Variable=paste(i,"cv",sep="_"))
df_cv<-rbind(df_cv,x)
df_cv$CV[is.nan(df_cv$CV)]<-0 # for 0/0 on CV formula giving NaN
}
- It resulted in the following table
df_cv
:
ID Eye Day Time Count Mean SD CV Variable
40001 L 1 1 3 0.56666667 0.057735027 10.1885342 measurement1_cv
40001 L 1 2 3 0.36666667 0.404145188 110.2214150 measurement1_cv
40001 L 1 3 3 0.50000000 0.000000000 0.0000000 measurement1_cv
- I reformatted
df_cv
above to wide format (Variables and CVs across row rather than down a column). This enabled me to merge the CVs with the originaldf
df_cv<-dcast(df_cv,PIDN+Eye+Day+Time~Variable,value.var = "CV")
df<-merge(df,df_cv,by=c("PIDN","Eye","Day","Time"))
ID Eye Video_number Time Day measurement1 measurement1_cv
40001 L 1 1 1 0.60 10.1885342
40001 L 2 1 1 0.50 10.1885342
40001 L 3 1 1 0.80 10.1885342
40001 L 1 2 1 0.80 110.2214150
40001 L 2 2 1 0.30 110.2214150
40001 L 3 2 1 0.00 110.2214150
- I know want to input NAs into the cells of measurement 1 that have a CV>45. I know how to do this measurement by measurement, but I was wondering if there was a for loop capable of doing this, since I have a lot of variables I am analyzing.
df$measurement1[df$measurement1_cv>45]<-NA
df$measurement2[df$measurement2_cv>45]<-NA
df$measurement3[df$measurement3_cv>45]<-NA
Below are my failed attempts:
for (i in vars) {
df<-df.m3
df$i[df$i_cv>45]<-NA
}
Error in `$<-.data.frame`(`*tmp*`, "i", value = logical(0)) :
replacement has 0 rows, data has 609
for (i in vars) {
df<-df.m3
df$i[df$paste(i,"_cv")>45]<-NA
}
Error in df$paste(i, "_cv") : attempt to apply non-function
Any help is greatly appreciated!