Find specific value in nested data frames and get position and/or value

Question

I have a nested list sampleList that can contain a variable number of data frames. In this example there are 3 data frames:

df1 <- data.frame(id = as.integer(c(1, 6)), key = c('apple', 'apple.green'), stringsAsFactors=FALSE)
df2 <- data.frame(id = as.integer(c(1, 3, 5)), key = c('apple', 'apple.red', 'apple.red.rotten'), stringsAsFactors = FALSE)
df3 <- data.frame(id = as.integer(c(17)), key = c('orange'), stringsAsFactors = FALSE)
sampleList <- list(df1, df2, df3)

I want to search for specific integers e.g. 6 in the id column across all data frames contained in the sampleList. As a result, I need the position and if possible the associated value from the key column.

The closest I got was the position in a specific data frame e.g. 1.

which(sampleList[[1]] == 6)
[1] 2

Since the number of data frames can be different each time, I need a more dynamic query.

Thanks a lot for your help.

What should happen to list elements where no `id` of 6 was found? — Donald Seinen, Jan 26 '22 at 16:17
They should not appear in the result. If no 6 is found in all data frames, a 0 as a result would be great. — i-box, Jan 26 '22 at 16:23

ctde · Answer 1 · 2022-01-26T17:20:05.273

I recommend you watch "Hadley Wickham: Managing many models with R" on YouTube if you have nested data, you'll be impressed with how useful it is. Then, I recommend you look at the example by Laurens Geffert, search "Nesting Birds and Models in R Dataframes".

I recommend using tibbles for nicer output, but given the data.frame format requested, I comment-out that coercing to tibbles.

Explanation 1: using dplyr logic with the pipe, we take from the list each object (data.frame) and apply a filter as you would to each data frame separately. The tilde (~) is the functional programming way to say 'apply this following function to all the objects in the list'. This approach is more practical if your goal is to operate on the data.frames without removing the dataframes as separate objects.

library(tidyr)
library(dplyr)
library(purrr)

df1 <- data.frame(id = as.integer(c(1, 6)), key = c('apple', 'apple.green'), stringsAsFactors=FALSE)
df2 <- data.frame(id = as.integer(c(1, 3, 5)), key = c('apple', 'apple.red', 'apple.red.rotten'), stringsAsFactors = FALSE)
df3 <- data.frame(id = as.integer(c(17)), key = c('orange'), stringsAsFactors = FALSE)

lt = lst(df1,# %>% as_tibble(.),
         df2,# %>% as_tibble(.),
         df3 #%>% as_tibble(.)
         )

lt %>% map(~filter(.,id==6))


# $df1
# id         key
# 1  6 apple.green
# 
# $df2
# [1] id  key
# <0 rows> (or 0-length row.names)
# 
# $df3
# [1] id  key
# <0 rows> (or 0-length row.names)

The next example to achieve what you want, or to answer your question(s) about getting values out.

Explanation 2: using lapply, we can get the respective positions in each data.frame or the values of column key, but I suspect you are looking to manipulate multiple data.frames simultaneously. If not, and you're just trying to find locations per data.frame (i.e., getting your hands dirty), then just grab positions with the classic base R logic per data.frame using lapply.

# which values per list object have the requested id==6
lapply(lt,function(x)which(x$id==6))

#value of column key per list object have the requested id==6
lapply(lt,function(x)x$key[which(x$id==6)])

Donald Seinen · Accepted Answer · 2022-01-28T05:03:55.000

2

I have slightly altered the data, adding 6 to df3.

df1 <- data.frame(id = as.integer(c(1, 6)), key = c('apple', 'apple.green'), stringsAsFactors=FALSE)
df2 <- data.frame(id = as.integer(c(1, 3, 5)), key = c('apple', 'apple.red', 'apple.red.rotten'), stringsAsFactors = FALSE)
df3 <- data.frame(id = as.integer(c(6, 17)), key = c('orange', 'blue'), stringsAsFactors = FALSE)
sampleList <- list(df1, df2, df3)

tidyverse

library(tidyverse)
imap_dfr(sampleList,
         ~ mutate(.x, pos = 1:n(), dfr = .y) %>%
           filter(id == 6)) %>%
  when(!!nrow(.) ~., ~0)


#>  id         key pos dfr
#> 1  6 apple.green   2   1
#> 2  6      orange   1   3

Explanation: using purrr we can access list indices within the lambda function through .y. The _dfr transforms the list to a tibble. when or {if(!nrow(.)) 0} can be used to conditionally return 0 if no values were found. The . is the placeholder dot in the magrittr pipe.
base R

Filter(nrow, 
       lapply(sampleList, subset, id == 6)
)
[[1]]
  id         key
2  6 apple.green

[[2]]
  id  key
1  6 orange

Explanation: We can first subset the list elements based on criteria, and later Filter out those that have nrow of 0, since F == 0.

Update: not all empty data.frames are equally empty.

Filter(nrow,
       lapply(sampleList, function(x){
         if(!!length(x)) subset(x, id == 6) else data.frame()
       })
)

To extract rownames, ensuring we retain the information of where matche were found,

Filter(nrow, 
       lapply(sampleList, subset, id == 6) |>
         setNames(1:length(sampleList)) # swap to appropriate naming policy
) |>
  lapply(\(x) as.integer(rownames(x)))

edited Jan 28 '22 at 05:03

answered Jan 26 '22 at 16:31

Donald Seinen

4,179
5
15
40

Thank you very much for your help! Sometimes a "data frame with 0 columns and 0 rows" is also nested in the sampleList. `[[1]] id key 1 17 apple orange [[2]] data frame with 0 columns and 0 rows` Do you know a solution how to avoid this error? – i-box Jan 27 '22 at 15:53
Very nice/concise answer. I like your answer more than mine because it grabs the key and index at the same time. Funnily enough, I have this problem in my work right now and this solution will speed it up. – ctde Jan 27 '22 at 15:59
When running the script with a nested `data frame with 0 columns and 0 rows` an error occurs: `comparison (1) is possible only for atomic and list types` Unfortunately I have no control over the `sampleList`. So the script has to handle the empty data frame. Of course it can be on any position. – i-box Jan 27 '22 at 16:14
@i-box if the input has 0 rows (such edge cases should be in the question) you could insert a `nrow > 0` check before running the `mutate` sequence. The cases do strike me as odd - could you specify what your use case is? Perhaps the data could be nested differently / more efficiently, for example. – Donald Seinen Jan 27 '22 at 16:32
@ctde the `purrr` approach in this case is syntactic sugar, the same can be achieved using `lapply`. `l/vapply` should be [faster](https://stackoverflow.com/questions/42393658/lapply-vs-for-loop-performance-r/70023363#70023363). If speed is a concern however, be careful with nested tibbles, and dplyr, as they are quite greedy for memory and type checks, significant speed up can be achieved using arrays. – Donald Seinen Jan 27 '22 at 16:40
@DonaldSeinen : The data comes from an external server, therefore I have no control how the data is structured. I was baffled as well when I discovered the empty data frames nested within the list. I also can’t figure it out where the empty data frames come from. Most of the times they are on 1st position and sometimes on 2nd or even 3rd. The purpose of this part of the script is to check whether specific numbers in the id column occur. If a certain number is in one of the nested data frames, a variable will be set with “Yes”. If the number doesn’t exist, a variable will be set to “No”. – i-box Jan 27 '22 at 21:56
@DonaldSeinen : I use the base R version you suggested: `Filter(nrow, lapply(sampleList, subset, id == 6))` Most of the times your solution works absolutely fine, because no empty data frames are nested in the `sampleList`. Thank you once again! But if an error occurs it is because of an empty data frame on the 1st or 2nd position. `[[1]] data frame with 0 columns and 0 rows ` or that: `[[1]] id key 1 17 apple orange [[2]] data frame with 0 columns and 0 rows` It would be great if you know how to handle this problem and prevent this kind of error. Thx! – i-box Jan 27 '22 at 21:58
@i-box when I add an empty `data.frame(id = integer(), key = character())` the solution seems to work, but `data.frame(id = NULL, key = NULL)` throws an error. I have added a type check in the answer. – Donald Seinen Jan 28 '22 at 04:58
@DonaldSeinen Thanks for your help! Meanwhile I have solved the handling of the empty data frames. It was a twostep solution. First line handles if `sampleList` contains just an empty df. The second line removes empty df from `sampleList` and leaves normal dfs in sampleList: `if(nrow(sampleList[[1]]) > 0){ sampleList <- sampleList [sapply(sampleList, function(x) dim(x)[1]) > 0] #more code…. }` – i-box Jan 28 '22 at 13:59

Find specific value in nested data frames and get position and/or value

2 Answers2