Remove rows until columns are identical over multiple data frames

Question

I have 4 data frames named w, x, y, z each with 3 columns and identical column names. I now execute an operation that removes rows until the column named Type is identical over all four data frames.

To achieve this I am using a while loop with the following code:


list_df <- list(z, w, x, y)
tmp <- lapply(list_df, `[[`, 'Type')
i <- as.integer(as.logical(all(sapply(tmp, function(x) all(x == tmp[[1]])))))
                
while (i == 0) {
                  
 z <- z[(z$Type %in% x$Type),]
 y <- y[(y$Type %in% x$Type),]
 w <- w[(w$Type %in% x$Type),]
                      
 z <- z[(z$Type %in% w$Type),]
 y <- y[(y$Type %in% w$Type),]
 x <- x[(x$Type %in% w$Type),]
                     
 z <- z[(z$Type %in% y$Type),]
 x <- x[(x$Type %in% y$Type),]
 w <- w[(w$Type %in% y$Type),]
                      
 x <- x[(x$Type %in% z$Type),]
 w <- w[(w$Type %in% z$Type),]
 y <- y[(y$Type %in% z$Type),]
                     
 list_df <- list(z, w, x, y)
 tmp <- lapply(list_df, `[[`, 'Type')
 i <- as.integer(as.logical(all(sapply(tmp, function(x) all(x == tmp[[1]])))))
 }

In this code, a list is created for the column Type of every data frame. Then the value i tests for identicality and produces 0 if false and 1 if true. The while loop then performs the deletion of rows not included in every data frame and only stops until i becomes 1.

This code works, but applying it to bigger data can result in a long time for the code to go through. Does anybody have an idea on how to simplify this execution?

For reproducible example:

w <- structure(list(Type = c("26809D", "28503C", "360254", "69298N", 
"32708V", "680681", "329909", "696978", "32993F", "867609", "51206K", 
"130747"), X1980 = c(NA, NA, NA, 271835, NA, NA, NA, NA, NA, 
NA, NA, NA), X1981 = c(NA, NA, NA, 290314, NA, NA, NA, NA, NA, 
NA, NA, NA)), row.names = c("2", "4", "7", "8", "10", "11", "13", 
"16", "17", "21", "22", "23"), class = "data.frame")

x <- structure(list(Type = c("26809D", "28503C", "360254", "69298N", 
"32708V", "680681", "329909"), X1980 = c(NA, NA, NA, 1026815, 
NA, NA, NA), X1981 = c(NA, NA, NA, 826849, NA, NA, NA)), row.names = c("2", 
"4", "7", "8", "10", "11", "13"), class = "data.frame")

y <- structure(list(Type = c("26809D", "28503C", "360254", "69298N", 
"32708V"), X1980 = c(NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_), X1981 = c(NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_)), row.names = c("2", "4", "7", "8", "10"), class = "data.frame")

z <- structure(list(Type = c("26809D", "28503C", "360254", "69298N", 
"32708V", "680681", "329909", "696978", "32993F", "867609", "51206K", 
"130747", "50610H"), X1980 = c(NA, NA, NA, 0.264736101439889, 
NA, NA, NA, NA, NA, NA, NA, NA, NA), X1981 = c(NA, NA, NA, 0.351108848169376, 
NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c("2", "4", 
"7", "8", "10", "11", "13", "16", "17", "21", "22", "23", "24"
), class = "data.frame")

I'm a little confused. Is the data you are showing at the end of the post your output? Those dataframes all share identical entries in the Type column, right? What is the input data? — Mario Niepel, Jan 03 '21 at 14:14
Sorry about that! I accidentally posted the output. I've edited the question to show the input. — Connor Uhl, Jan 03 '21 at 14:31

G. Grothendieck · Accepted Answer · 2021-01-03T15:03:43.777

We assume that the question is how to get the values of Type that are common to 4 data frames each of which has a Type column containing unique values.

Form a list L of the data frames, extract the Type column using lapply and [ and iterate merge over that using Reduce :

L <- list(w, x, y, z)
L.Type <- lapply(L, "[", TRUE, "Type", drop = FALSE) # list of DFs w only Type col
Reduce(merge, L.Type)$Type
## [1] "26809D" "28503C" "32708V" "360254" "69298N"

or replace last line with this giving the same result except for order:

Reduce(intersect, L.Type)$Type
## [1] "26809D" "28503C" "360254" "69298N" "32708V"

Another approach which is a bit tedious but does reduce the calulation to one line is to manually iterate intersect:

intersect(w$Type, intersect(x$Type, intersect(y$Type, z$Type)))
## [1] "26809D" "28503C" "360254" "69298N" "32708V"

Another example

The example data is not very good to illustrate this because every data frame has the same values of Type so let us create another example. BOD is a built-in data frame has 6 rows. We assign it to X and rename the columns so that the first one has the name Type. Then for i equals 1, 2, 3, 4 we remove the i-th row giving 4 data frames with 5 rows each and 2 values of Type common to all 4. The result correctly shows that 5 and 7 are the only common Type values.

# set up input L, a list of 4 data frames
X <- BOD
names(X) <- c("Type", "X")
L <- lapply(1:4, function(i) X[-i, ])

L.Type <- lapply(L, "[", TRUE, "Type", drop = FALSE)
Reduce(merge, L.Type)$Type
## [1] 5 7

Thanks for providing this extensive solution. I have to admit, I am a little confused by the `L.Type <- lapply(L, "[", TRUE, "Type", drop = FALSE)` command. If you feel inclined to do so, would you mind explaining why it works? — Mario Niepel, Jan 03 '21 at 14:59
It is the same as `lapply(L, function(x) x[TRUE, "Type", drop = FALSE])` which is the same as `lapply(L, function(x) x[, "Type", drop = FALSE])` since `lapply` passes an object to the first argument of the function `[`and then uses the remaining specified arguments to pass to it. — G. Grothendieck, Jan 03 '21 at 15:01
That's cool. So `[` is a function in R. That makes sense. And now that I know there are other examples here on SO that discuss this [e.g. here](https://stackoverflow.com/questions/19260951/using-square-bracket-as-a-function-for-lapply-in-r). And the `TRUE` is the placeholder argument that makes sure all columns are selected by leaving the second argument of the function ` "["(df, rows, cols)` empty? — Mario Niepel, Jan 03 '21 at 15:11

Remove rows until columns are identical over multiple data frames

1 Answers1

Another example