4

I have two needs, both connected to a dataset similar to the reproducible one below. I have a list of 18 entities, each composed of a list of 17-19 data.frames. Reproducible dataset follows (there are matrices instead of data.frames, but I do not suppose that makes a difference):

test <- list(list(matrix(10:(50-1), ncol = 10), matrix(60:(100-1), ncol = 10), matrix(110:(150-1), ncol = 10)),
             list(matrix(200:(500-1), ncol = 10), matrix(600:(1000-1), ncol = 10), matrix(1100:(1500-1), ncol = 10)))
  1. I need to subset each dataframe/matrix into two parts (by a given number of rows) and save to a new list of lists
  2. Secondly, I need to extract and save a given column(s) out of every data.frame in a list of lists.

I have no idea how to go around doing it apart from for(), but I am sure it should be possible with apply() family of functions.

Thank you for reading

EDIT:

My expected output would look as follows:

extractedColumns <- list(list(matrix(10:(50-1), ncol = 10)[, 2], matrix(60:(100-1), ncol = 10)[, 2], matrix(110:(150-1), ncol = 10)[, 2]),
                         list(matrix(200:(500-1), ncol = 10)[, 2], matrix(600:(1000-1), ncol = 10)[, 2], matrix(1100:(1500-1), ncol = 10)[, 2]))


numToSubset <- 3
substetFrames <- list(list(list(matrix(10:(50-1), ncol = 10)["first length - numToSubset rows", ], matrix(10:(50-1), ncol = 10)["last numToSubset rows", ]), 
                           list(matrix(60:(100-1), ncol = 10)["first length - numToSubset rows", ], matrix(60:(100-1), ncol = 10)["last numToSubset rows", ]),
                                list(matrix(110:(150-1), ncol = 10)["first length - numToSubset rows", ], matrix(110:(150-1), ncol = 10)["last numToSubset rows", ])),
                      etc...)

It gets to look very messy, hope you can follow what I want.

Elin
  • 6,507
  • 3
  • 25
  • 47
pun11
  • 157
  • 1
  • 11
  • 2
    this looks pretty straightforward use of `lapply()`, possibly nested. Would you please update the post with your desired output? – C8H10N4O2 Jan 25 '17 at 15:58
  • Thank you. Edited my question – pun11 Jan 25 '17 at 16:18
  • Desired output is not clear nor completely runnable code but pseudocode. Which one is it *extractedColumns* or *substetFrames*? And in #2, you say extract given columns but you are attempting extracted rows for `substetFrames` even using the word *rows* in psuedocode? – Parfait Jan 25 '17 at 17:24
  • @Parfait: Thanks for replying. Actually I think it is quite clear, please do let me know what seems unclear to you. I want two outputs (2 questions) - subsetFrames is a list of 2 frames subset by a given rows, extractColumn is a list of columns extracted from the original data frames. It corresponds to my two questions (albeit in reversed order, if that confuses you) – pun11 Jan 26 '17 at 10:01

2 Answers2

6

You can use two nested lapplys:

lapply(test, function(x) lapply(x, '[', c(2, 3)))

Ouput:

[[1]]
[[1]][[1]]
[1] 11 12

[[1]][[2]]
[1] 61 62

[[1]][[3]]
[1] 111 112


[[2]]
[[2]][[1]]
[1] 201 202

[[2]][[2]]
[1] 601 602

[[2]][[3]]
[1] 1101 1102

Explanation

The first lapply will be applied on the two lists of test. Each one of those two lists contain another 3. The second lapply will iterate over those 3 lists and subset (thats the '[' function in the second lapply) columns c(2, 3).

Note: In the case of a matrix [ will subset elements 2 and 3 but the same function will subset columns when used on a data.frame.

Subsetting rows and columns

lapply is very flexible with the use of anonymous functions. By changing the code into:

#change rows and columns into what you need
lapply(test, function(x) lapply(x, function(y) y[rows, columns]))

You can specify any combination of rows or columns you want.

LyzandeR
  • 37,047
  • 12
  • 77
  • 87
  • Perfect solution for my second question - extracting given columns of the data frames. Also really appreaciate the explanation. I had one more question - how to subtract the data frames by rows (e.g. each data frame split in half by rows and save to a new list of lists of lists of data frames)? Please if you could answer that too, I will be grateful and accept your answer. – pun11 Jan 26 '17 at 10:14
  • Tried this. It works when I use `split()` on a given data frame, but when used in lapply with "x", it throws the following error: `splitFrames <- lapply(test, function(x) split(x, c(rep(1, dim(x)[1] - 3), rep(2, 3)))) Error in rep(1, dim(x)[1] - 3) : invalid 'times' argument` – pun11 Jan 26 '17 at 10:26
  • I have updated the answer. The explanation is exactly the same as the one I wrote previously. Replacing rows with e.g. `1:5` and columns with `c(2,3)`, will extract first 5 rows and 2,3 columns for each of the data.frames. – LyzandeR Jan 26 '17 at 10:29
  • `split` is not the right way to go here. Nested `lapply`s is the right way. – LyzandeR Jan 26 '17 at 10:30
  • Okay, I think I got it. But it is only thanks to you - I have never tried using nested applies. Keep it up please, we learn a lot from people like you! `splitFrames <- lapply(test, function(x) lapply(x, function(y) split(y, c(rep(1, dim(y)[1] - 3), rep(2, 3)))))` – pun11 Jan 26 '17 at 10:35
  • DId not notice the update before I posted. Good to know another solution. – pun11 Jan 26 '17 at 10:36
  • 1
    You are very welcome. The R community is growing and people are willing to help others, which is always nice :). If your actual data is not too complicated (as in the data.frames have the same structure) I would avoid the use of `split` as it creates more lists. Simple subsetting with nested `lapply`s should work fine. It takes some practice but you ll get there! – LyzandeR Jan 26 '17 at 10:40
  • I am sorry to disturb, but I still cannot my head around the notation `lapply(x, '[', c(2, 3))` . How can I think of the '[' part? And does there exist a similar symbol doing same thing for rows? – pun11 Jan 26 '17 at 10:42
  • Everything is a function in R, including `[`. `'['(1:5, 1)` is the same as `1:5[1]`. You could do `lapply(x, '[', 1:5, c(2, 3))`. `1:5` and `c(2,3)` are arguments passed on to `[`. There are many tutorials about the use of `lapply`, you could benefit from. Just google how to use `lapply`. – LyzandeR Jan 26 '17 at 10:47
1

To piggyback @LyzandeR's answer, consider the often ignored sibling of the apply family, rapply that can recursively run functions on lists of vectors/matrices, returning such nested structures. Often it can compare to nested lapply or its variants v/sapply:

newtest1 <- lapply(test, function(x) lapply(x, '[', c(2, 3)))

newtest2 <- rapply(test, function(x) `[`(x, c(2, 3)), classes="matrix", how="list")

all.equal(newtest1, newtest2)
# [1] TRUE

Interestingly, to my amazement, rapply runs slower in this use case compared to nested lapply! Hmmmm, back to the lab I go...

library(microbenchmark)

microbenchmark(newtest1 <- lapply(test, function(x) lapply(x, '[', c(2, 3))))    
# Unit: microseconds
#     mean median     uq    max neval
# 31.92804 31.278 32.241 74.587   100

microbenchmark(newtest2 <- rapply(test, function(x) `[`(x, c(2, 3)),
                                        classes="matrix", how="list"))    
# Unit: microseconds
#    min    lq     mean median      uq    max neval
# 69.293 72.18 79.53353 73.143 74.5865 219.91   100

Even more interesting, is removing the [ operator for the equivalent matrix bracketing, nested lapply runs even better and rapply even worse!

microbenchmark(newtest3 <- lapply(test, function(x) 
                                  lapply(x, function(y) y[c(2, 3), 1])))
# Unit: microseconds
#    min     lq     mean median     uq    max neval
# 26.947 28.391 32.00987 29.354 30.798 100.09   100

all.equal(newtest1, newtest3)
# [1] TRUE

microbenchmark(newtest4 <- rapply(test, function(x) x[c(2,3), 1], 
                                  classes="matrix", how="list"))
# Unit: microseconds
#    min     lq     mean median     uq     max neval
# 74.105 76.752 80.37076 77.955 78.918 203.549   100

all.equal(newtest2, newtest4)
# [1] TRUE
Parfait
  • 104,375
  • 17
  • 94
  • 125