Creating a loop in R to clean multiple datasets

Question

Suppose I have the following dataframes (essentially 4 similar datasets):

q1_hosp <- data.frame(dxe1 = c(1,NULL, NA, NULL, 1), dxe2 = c(1,NULL, NULL, NULL, 1))
q2_hosp <- data.frame(dxe1 = c(NULL,1, 1, NA, 1), dxe2 = c(1,1, 1, NULL, 1))
q3_hosp <- data.frame(dxe1 = c(NA,NULL, 1 1, 1), dxe2 = c(1,NA, NA, NULL, 1))
q4_hosp <- data.frame(dxe1 = c(1,1, 1 NULL, 1), dxe2 = c(1,NULL, 1, 1, 1))

What I would like is to turn the NULL entries into missing values in R. I can do them by writing the following codes in a tedious manner:

any(is.na(q1_hosp$dxe1)) # there are missing rows, keep NAs
q1_hosp$dxe1 <- na_if(q1_hosp$dxe1, "NULL") #convert NULLto NA
any(is.na(q1_hosp$dxe2)) # there are missing rows, keep NAs
q1_hosp$dxe2 <- na_if(q1_hosp$dxe2, "NULL") #convert NULLto NA

any(is.na(q2_hosp$dxe1)) # there are missing rows, keep NAs
q2_hosp$dxe1 <- na_if(q2_hosp$dxe1, "NULL") #convert NULLto NA
any(is.na(q2_hosp$dxe2)) # there are missing rows, keep NAs
q2_hosp$dxe2 <- na_if(q2_hosp$dxe2, "NULL") #convert NULLto NA

any(is.na(q3_hosp$dxe1)) # there are missing rows, keep NAs
q3_hosp$dxe1 <- na_if(q3_hosp$dxe1, "NULL") #convert NULLto NA
any(is.na(q3_hosp$dxe2)) # there are missing rows, keep NAs
q3_hosp$dxe2 <- na_if(q3_hosp$dxe2, "NULL") #convert NULLto NA

any(is.na(q4_hosp$dxe1)) # there are missing rows, keep NAs
q4_hosp$dxe1 <- na_if(q4_hosp$dxe1, "NULL") #convert NULLto NA
any(is.na(q4_hosp$dxe2)) # there are missing rows, keep NAs
q4_hosp$dxe2 <- na_if(q4_hosp$dxe2, "NULL") #convert NULLto NA

What I would like is to create a loop so that the process is less tedious.

I have been trying with this

  for(i in 1:4) {
  ##Clean the dxe variables
  any(is.na(q[i]_hosp$dxe1)) # there are missing rows, keep NAs
  q[i]_hosp$dxe1 <- na_if(q[i]_hosp$dxe1, "NULL") #convert NULLto NA
 any(is.na(q[i]_hosp$dxe2)) # there are missing rows, keep NAs
  q[i]_hosp$dxe1 <- na_if(q[i]_hosp$dxe2, "NULL") #convert NULLto NA
}

But my code doesn't work and I am getting errors. Can anyone help write the for loop for what I want to achieve?

Use lists of frames, https://stackoverflow.com/a/24376207/3358227. `q[i]_host` will not work, that is looking for a vector or list `q` to find the `i`th element, and they will get a parsing error because of `_host`. I suggest you put your frames into a list-of-frames, as it will make many operations done on one frame easy to do on all frames (at once). — r2evans, Apr 30 '21 at 00:42
This is more efficient https://stackoverflow.com/questions/30810681/r-data-table-replace-null-with-na-when-columns-are-factors — Peace Wang, Apr 30 '21 at 00:49

score 2 · Accepted Answer · answered Apr 30 '21 at 01:25

2

Get the data in a list and use lapply to change 'NULL' to NA from each of the dataframe.

result <- lapply(mget(sprintf('q%d_hosp', 1:4)), function(x) {x[x == 'NULL'] <- NA;x})

If you want the changes in individual dataframes back.

list2env(result, .GlobalEnv)

answered Apr 30 '21 at 01:25

Ronak Shah

377,200
20
156
213

1

Thanks. `list2evn` is useful. Learn a lot from you. – Peace Wang Apr 30 '21 at 02:09
@Ronak Shah So for the actual dataset, I am having trouble getting the NAs if I use your code above. Suppose I have a variable dxe1 that is common in all datasets and I want to convert all the NULLs to NAs in the four datasets q1_hosp, q2_hosp, q3_hosp, q4_hosp. What is the role of 'x' in your code? I got the data in a list using q1_hosp <- list(q1_hosp). Named the list and the dataframe the same. – OGC May 10 '21 at 23:58
Btw dxe1 is a character variable – OGC May 11 '21 at 00:36
`x` is the individual dataframe in each iteration. So if you only want to change `dx1` column in your dataframes. `cols <- c('dx1')` and `result <- lapply(mget(sprintf('q%d_hosp', 1:4)), function(x) {x[cols][x[cols] == 'NULL'] <- NA;x})` should work. – Ronak Shah May 11 '21 at 01:49

score 1 · Answer 2 · answered Apr 30 '21 at 01:12

Try this.

all names of dataset are stored in dl, which can access with get function
RMNULL function is used to replace NULL by NA
lapply function is used to loop all datasets

# dataset name list 
dl <- sprintf("%s%s%s","q",1:2,"_hosp")
df.list <- lapply(dl,get)

# replace NULL with NA
## remove NULL function
RMNULL <- function(df){
    is.na(df) <- df == "NULL"
    return(df)
}
lapply(df.list, RMNULL)

example data:

q1_hosp <- structure(list(x = 1:3, y = list(NULL, NULL, NULL)), .Names = c("x",  "y"), row.names = c(NA, -3L), class = "data.frame")

q2_hosp <- structure(list(x = 4:6, y = list(1, NULL, NULL)), .Names = c("x",  "y"), row.names = c(NA, -3L), class = "data.frame")

Creating a loop in R to clean multiple datasets

2 Answers2