1

I have a large list of dataframes that I would like to subset and reorder, so that they can be horizontally joined. Ideally, I would merge them, however this consistently exceeds my computer's capabiltiies, so I need to join them manually.

My large list is a list of 10 dataframes, all with 10 unique variables and one shared variable, with over 1000 rows for each dataframe:

df1:

ID, V2, V3, etc.

df2:

ID, V2_2, V3_2, etc.

and so forth.

All of these dataframes are contained within the same large list. How can I subset all of these dataframes within that large list simultaneously, such that they only include observations where ID > 100? Similarly, how can I order all of these dataframes simultaneously by the ID column, so that I can then bind them horizontally?

This seems to be the only way to have the dataframes merge, because merging the dataframes by ID directly does not appear to work, since the size of the dataframes (even though IDs are all the same for each dataframe, and each ID is unique) causes my R to freeze.

flâneur
  • 633
  • 2
  • 8
  • 2
    10 x 10 x 1000's shouldn't make R crashing. Can you add the code you tried to use and an example of your data? – harre Aug 01 '22 at 17:49
  • @harre unfortunately, I cannot provide the data, but I have tried various types of merging codes, including ```merge <- Reduce(function(x, y) merge(x, y, by = "ID"), list)``` or manually doing it in sequence like: ```merge <- merge(list[1],list[2], by='ID') %>% merge(., list[3], by='ID') %>% etc. ``` – flâneur Aug 01 '22 at 17:53
  • To get good answers then, you might consider: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – harre Aug 01 '22 at 18:02

2 Answers2

3

Try the following. Untested, since there are no test data.

1. Filter and sort

Filter ID > 100 and sort by ID.

df_list <- lapply(df_list, \(x) {
  x <- x[x$ID > 100, ]
  x[order(x$ID), ]
})

2. Merge the data.frames

This is the complicated part. The code below creates a result data.frame df_all and then puts each data.frame in df_list in its place by matching the id's. This is probably faster and no merge and Reduce function calls are needed.

all_cols <- unique(sapply(df_list, names))
all_rows <- unique(sapply(df_list, `[[`, 'ID'))
all_rows <- sort(all_rows)
df_all <- matrix(nrow = length(all_rows), ncol = length(all_cols))
df_all <- as.data.frame(df_all)
names(df_all) <- all_cols
df_all$ID <- all_rows

for(i in seq_along((df_list))) {
  i <- match(df_list[[i]]$ID, all_rows)
  df_all[i, names(df_list[[i]])] <- df_list[[i]]
}
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • Thank you, this first solution works. The second part makes sense, but I receive errors. On the first line, I receive the following ```Error in base::union(x, y, ...) : argument "y" is missing, with no default ``` Any idea of a change to be made? If it could work as well, that would be ideal to avoid the ```merge```/```reduce``` issues – flâneur Aug 01 '22 at 20:00
  • 1
    @flâneur Ok, It should be `unique`, in my tests I had more than one vector. Edited, see now if it works. – Rui Barradas Aug 01 '22 at 21:11
1

This should return the desired list:

lapply(df_list, \(df) df[df$ID > 100, ])

EDIT

Update after comment that you have a list of lists. Try this:

lapply(df_list, \(df) {
    df  <- data.frame(unlist(df, recursive=FALSE))
    df[df$ID > 100, ]
    }
)
SamR
  • 8,826
  • 3
  • 11
  • 33
  • Thank you, I see the rational behind this, however all this returns in the console is the following ``` [[1]] [[2]] [[3]] [[4]] [[5]] [[6]] [[7]] [[8]] [[9]] [[10]] ``` with seemingly no change to any of the dataframes. Any reason why this is happening? – flâneur Aug 01 '22 at 17:58
  • 1
    So it is returning 10 empty dataframes? Hard to say why without a reproducible example of your data using `dput()`. Is your ID column a numeric vector? If you have an ID column and some values are above 100 this should work. – SamR Aug 01 '22 at 18:07
  • Apologies for not being able to provide the example now. It appears to return the exact same dataframe as before, and just reads those numbers in the console. I am wondering if it is an issue with the lapply function I used to create the large list of dataframes. The large list appears to actually be a large list of large lists, which I guess makes it difficult to operate with. Is there a way to prevent the double large listing? – flâneur Aug 01 '22 at 18:14
  • Oh it's a list of lists? I thought it was a list of data frames. How many data frames does each sublist contain? – SamR Aug 01 '22 at 18:19
  • Each sublist is one dataframe. It seems I created lists or "tibbles" when in reality each should be a dataframe. Is there a way to convert each sublist into a dataframe within the large list? Then your solution would likely work. – flâneur Aug 01 '22 at 18:22