How to identify the missing rows of a specific column and do analysis on their available values in other columns

Question

I have data like this:

ID <- c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3)
X1<-c(1.1,0.2,0.4,0.8,1.3,2.3,1.1,3.2,NA,0.8,2.1,NA,1.1,0.2,0.4,0.8,NA,0.6)
X2<-c(0.8,NA,1.2,0.3,NA,NA,0.8,NA,1.5,2.7,2.2,NA,0.8,3.1,1.7,0.3,1.1,2.4)
X3<-c(0.1,0.3,1.1,2.2,0,NA,0.1,3.3,1.4,2.3,0,NA,NA,0.3,2.8,2.3,0,NA)
Time<-c("baseline","week1","week2","week3","week4","week5","baseline","week1","week2","week3","week4","week5","baseline","week1","week2","week3","week4","week5")
data<-data.frame(ID,X1,X2,X3,Time)

What I want to do is to:

Find the number of missing values for each of X1, X2, X3 and do the descriptive stat mean+/- SD for those same missing IDs but when Time=baseline. (So for instance in X3 ID=1 has a missing value at Week5 so this ID should be identified also I could use its information at Baseline - which is not missing- to eventually do the descriptive statistics)
Find out from which time point (Time=?), X2 and X3 started to get missing values.
Find the IDs that were missing for each of X1, X2, X3

Does anyone know of any code that can do that?

Have you tried to use any functions or to write code that implements your goal? If so, what outputs or errors did you get? Your `baseline` object is missing from the example data that you provide. It's also not clear what the `week_` objects are supposed to be. — John Polo, Nov 13 '22 at 22:04
@JohnPolo I`m unfortunately not a trained statistician or have much experience in R John. I`ve been trying to find them sort of manually but it doesn`t look feasible! I however used to manage to find NAa using describBy command but that doesn`t help me run analysis on those missing rows where they have non-missing values in other desire variables at Time baseline — Aura, Nov 13 '22 at 22:08
The solution would require prior knowledge of the variables, sources, sorting... to provide a basis for interpolation. — , Nov 13 '22 at 22:12
While statistical knowledge is important, several of your objectives in this question only rely on code: "find the number of miss values...", "find out from which time point...". Stats has nothing to do with those tasks. Since you asked on this forum, people here are focused on code, not stats. If you have purely statistical questions, you should ask on stats.stackexchange.com. — John Polo, Nov 13 '22 at 22:15
@Strom Thank you, but I`m not sure what sort of details I should be provided here. This is a set of longitudinal data containing a series of outcome variables, a series of indicator variables, ID, and Time. The last timepoint includes much missingness for the outcome and also for indicators. So trying to find the number of them + having a descriptive of those patients at baseline who had missing data last week is crucial and also clinically it is important to know starting from which timepoint these patients started to get missing values for outcomes and indicators. — Aura, Nov 13 '22 at 22:18
@JohnPolo Thanks John, yes, as I mentioned in my last line of question, I am looking for a code that can do this for me. I also believe these questions are not statistical except for the descriptive stat. But thank you for the heads-up. I think here should be the right place to ask this question :) — Aura, Nov 13 '22 at 22:21
You can provide your data by using `dput(nameofyourdata)`. If that creates a very large output, you can use `dput(head(nameofyourdata))` instead. You can use the "edit" function for your question and replace the data that you originally supplied with that. Also consider reading this: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — John Polo, Nov 13 '22 at 22:21
@JohnPolo Thanks John, but does this mean sharing the original data? Because that`s not possible given the data safety security terms :) — Aura, Nov 13 '22 at 22:24
Aura, in general, it is expected that people who ask for help with code have made an attempt at the code. Show some effort. — John Polo, Nov 13 '22 at 22:24
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/249573/discussion-between-john-polo-and-aura). — John Polo, Nov 13 '22 at 22:26

score 0 · Answer 1 · answered Nov 14 '22 at 02:38

Along with the other problems in this question, you asked for help with three different objectives. In other words, you asked three questions in one. That's also frowned upon.

This code addresses your first objective:

library(tidyverse)

ID <- c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3)
X1<-c(1.1,0.2,0.4,0.8,1.3,2.3,1.1,3.2,NA,0.8,2.1,NA,1.1,0.2,0.4,0.8,NA,0.6)
X2<-c(0.8,NA,1.2,0.3,NA,NA,0.8,NA,1.5,2.7,2.2,NA,0.8,3.1,1.7,0.3,1.1,2.4)
X3<-c(0.1,0.3,1.1,2.2,0,NA,0.1,3.3,1.4,2.3,0,NA,NA,0.3,2.8,2.3,0,NA)
Time<-c("baseline","week1","week2","week3","week4","week5","baseline","week1","week2","week3","week4","week5","baseline","week1","week2","week3","week4","week5")
data<-data.frame(ID,X1,X2,X3,Time)

data %>% pivot_longer(cols=c(X1,X2,X3), names_to="Xtypes") %>% 
     group_by(ID, Time) %>% 
     summarize(sumNA=sum(is.na(value)), meanNA=mean(is.na(value)), sdNA=sd(is.na(value)))

# That returns the following:
`summarise()` has grouped output by 'ID'. You can override using the `.groups` argument.
# A tibble: 18 × 5
# Groups:   ID [3]
      ID Time     sumNA meanNA  sdNA
   <dbl> <chr>    <int>  <dbl> <dbl>
 1     1 baseline     0  0     0    
 2     1 week1        1  0.333 0.577
 3     1 week2        0  0     0    
 4     1 week3        0  0     0    
 5     1 week4        1  0.333 0.577
 6     1 week5        2  0.667 0.577
 7     2 baseline     0  0     0    
 8     2 week1        1  0.333 0.577
 9     2 week2        1  0.333 0.577
10     2 week3        0  0     0    
11     2 week4        0  0     0    
12     2 week5        3  1     0    
13     3 baseline     1  0.333 0.577
14     3 week1        0  0     0    
15     3 week2        0  0     0    
16     3 week3        0  0     0    
17     3 week4        1  0.333 0.577
18     3 week5        1  0.333 0.577

pivot_longer changes the shape of your data frame, group_by applies function(s) to the data grouped according to the variable(s) named, and summarize is the verb that runs the function(s) therein. You asked for a sum ("number of"), mean, and sd.

You also wrote "... but when Time=baseline". I don't know what you mean by that. Were you looking only for when literally Time=="baseline"? If that's the case, you want this instead:

data %>% pivot_longer(cols=c(X1,X2,X3), names_to="Xtypes") %>% 
     group_by(ID) %>% 
     filter(Time=="baseline") %>% 
     summarize(sumNA=sum(is.na(value)), meanNA=mean(is.na(value)), sdNA=sd(is.na(value)))

If you meant NOT when Time=baseline, change the == in filter to !=.

How to identify the missing rows of a specific column and do analysis on their available values in other columns

1 Answers1