Subsetting dataframes stored in a list

Question

I'm having difficulty figuring out how to subset some specific data from dataframes stored in a list. I've read numerous articles on this site as well as UCLA and Adv-R and I'm just not making any progress.

Advanced-R for Subsetting UCLA Advanced R for Subsetting

My function reads in arguments that help it identify what data I'm interested in pulling out across a range of files. So, dat1, dat2 and dat3 in files 1:15 stored in a directory of files (1:999).

Using an lapply and read.CSV I have read all of my files (1:15) into a list of dataframes.

 x <- lapply(directory[id], function(i) {
        read.csv(i, header = TRUE)
         } )

An example looks like this via str(x) [of just the first element]:

List of 15
 $ :'data.frame':   1461 obs. of  4 variables:
  ..$ DateObv   : Factor w/ 1461 levels "2003-01-01","2003-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
  ..$ dat1: num [1:1461] NA NA NA NA NA NA NA NA NA NA ...
  ..$ dat2: num [1:1461] NA NA NA NA NA NA NA NA NA NA ...
  ..$ ID     : int [1:1461] 1 1 1 1 1 1 1 1 1 1 ...

So in the argument to my function I want to tell it give me dat1 from files 1:15 and then I'll do a mean of the results.

I thought maybe I could use another lapply to subset dat1 specifically into a vector but it keeps returning a NULL value, or "list()" or just errors that set object cannot be subset, or subset missing argument. I've tried subset, bracket notation.

How do you recommend that I take a subset of the list of dataframes so that I get back all dat1's or dat2's into a single vector that I can run a mean against?

Thank you for your time and consideration.

I guess you can make use of something like lapply(x,`[[`,'dat1'), which will return a list of vectors corresponding to the 'dat1' columns from each data frame — Marat Talipov, Feb 11 '15 at 21:35
What exactly is the code you were trying that gave you the error? i would think `unlist(lapply(x, "[[", "dat1"))` might work. An actual [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) would be more useful here than just a description of the structure. — MrFlick, Feb 11 '15 at 21:35
Hello @mrflick here's a sample of observation 1. Date dat1 dat2 ID 10/11/2003 NA NA 1 10/12/2003 5.99 0.428 1 10/13/2003 NA NA 1 10/14/2003 NA NA 1 10/15/2003 NA NA 1 10/16/2003 NA NA 1 10/17/2003 NA NA 1 10/18/2003 4.68 1.04 1 10/19/2003 NA NA 1 10/20/2003 NA NA 1 10/21/2003 NA NA 1 10/22/2003 NA NA 1 10/23/2003 NA NA 1 10/24/2003 3.47 0.363 1 10/25/2003 NA NA 1 10/26/2003 NA NA 1 10/27/2003 NA NA 1 10/28/2003 NA NA 1 10/29/2003 NA NA 1 10/30/2003 2.42 0.507 1 — Zach, Feb 11 '15 at 22:00
@MrFlick I tried your method and it just returns "NULL", which leads me to believe that my dat1 argument isn't actually being used. I know that if I do a simple print(dat1) that it will give me my argument within the function. y <- unlist(lapply(x, "[[", "dat1")) — Zach, Feb 11 '15 at 22:20
What you posted above in the comments is not a reproducible example. Try `dput()`-ing a sample object or build a sample list in your original question. Read the link i provided for other examples. There must be something different going on that what you described. — MrFlick, Feb 11 '15 at 22:23
@MrFlick your answer was the right one but you didn't officially give an answer so I cannot upvote you. I had "dat1" in quotes so the function didn't do anything with it, which is why it returned NULL. Thank you so much for the response it worked wonderfully! The final code snippet looks like this: x <- lapply(directory[id], function(i) { read.csv(i, header = TRUE) } ) y <- unlist(lapply(x, "[[", dat1)) output <- mean(y, na.rm = TRUE) output — Zach, Feb 11 '15 at 22:41

score 1 · Answer 1 · answered Feb 11 '15 at 21:41

I love plyr for this sort of thing. I would do something like this if you want the mean for each data.frame:

 library(plyr)
 ldply(x, summarize, Mean = mean(dat1))

or, if you want a long vector of all the dat1 columns and you want to take the mean of all of them, I'd still use plyr but do this:

 x <- rbind.fill(x)
 mean(x$dat1)

score 0 · Answer 2 · answered Feb 11 '15 at 21:49

0

create a similar data set:

> x = list(data.frame(dat1 = 1:3,dat2=10), data.frame(dat1 = 2:4,dat2=10))
> str(x)
List of 2
 $ :'data.frame':   3 obs. of  2 variables:
  ..$ dat1: int [1:3] 1 2 3
  ..$ dat2: num [1:3] 10 10 10
 $ :'data.frame':   3 obs. of  2 variables:
  ..$ dat1: int [1:3] 2 3 4
  ..$ dat2: num [1:3] 10 10 10

use lapply to select variable dat1:

> lapply(x, function(X) X$dat1)
[[1]]
[1] 1 2 3

[[2]]
[1] 2 3 4

bind the resulting list to a vector with c, call mean on the resulting vector, and add na.rm=TRUE to remove the NA values:

> mean(do.call(c, lapply(x, function(X) X$dat1)),na.rm=TRUE)
[1] 2.5

answered Feb 11 '15 at 21:49

Edzer Pebesma

3,814
16
26

Hi Edzer thank you for the feedback. When I do it your way I get the following error: Warning message: In mean.default(do.call(c, lapply(x, function(X) X$dat1)), : argument is not numeric or logical: returning NA dat1 is an argument passed into the function that makes sure the function only selects dat1 for vectorization and mean calculation. – Zach Feb 11 '15 at 22:07
There has to be something going on with my function in general because if I try the sample list as provided I can subset it with no issues. The mean function still doesn't work but at least the subsetting does. So that gives me something to explore more. – Zach Feb 11 '15 at 22:31
This warning indicates that some dat1 vectors are not numeric, but for instance factor, unlike the one you show above. I'd suggest to build in a check that catches this. – Edzer Pebesma Feb 12 '15 at 09:43

Subsetting dataframes stored in a list

2 Answers2