4

I'm using R-Studio 0.99.491 and R version 3.2.3 (2015-12-10). I'm a relative newbie to R, and I'd appreciate some help. I'm doing a project where I'm trying to use the server logs on an old media server to identify which folders/files within the server are still being accessed and which aren't, so that my team knows which files to migrate. Each log is for a 24 hour period, and I have approximately a year's worth of logs, so in theory, I should be able to see all of the access over the past year.

My ideal output is to get a tree structure or plot that will show me the folders on our server that are being used. I've figured out how to read one log (one day) into R as a data.frame and then use the data.tree package in R to turn that into a tree. Now, I want to recursively go through all of the files in the directory, one by one, and add them to that original data.frame, before I create the tree. Here's my current code:

#Create the list of log files in the folder
files <- list.files(pattern = "*.log", full.names = TRUE, recursive = FALSE)
#Create a new data.frame to hold the aggregated log data
uridata <- data.frame()
#My function to go through each file, one by one, and add it to the 'uridata' df, above
lapply(files, function(x){
    uriraw <- read.table(x, skip = 3, header = TRUE, stringsAsFactors = FALSE)
    #print(nrow(uriraw)
    uridata <- rbind(uridata, uriraw)
    #print(nrow(uridata))
})

The problem is that, no matter what I try, the value of 'uridata' within the lapply loop seems to not be saved/passed outside of the lapply loop, but is somehow being overwritten each time the loop runs. So instead of getting one big data.frame, I just get the contents of the last 'uriraw' file. (That's why there are those two commented print commands inside the loop; I was testing how many lines there were in the data frames each time the loop ran.)

Can anyone clarify what I'm doing wrong? Again, I'd like one big data.frame at the end that combines the contents of each of the (currently seven) log files in the folder.

John Lynch
  • 75
  • 1
  • 1
  • 9
  • [here](http://stackoverflow.com/questions/16984529/r-combine-summary-of-multiple-csv-files-into-one-data-frame) or [here](http://stackoverflow.com/questions/30608177/how-to-read-every-csv-file-in-r-and-export-them-into-single-large-file/30608244) could be useful – rawr Apr 27 '16 at 23:53
  • 1
    `uridata` stays the same because function doesn't have side effects in R which is one of the most important features of functional programming. As @rawr pointed out, you can do something like this instead `do.call("rbind", lapply(files, read.table, skip = 3, header = T, stringsAsFactors = F))`. Or if you replace your `lapply` function with a `for loop`, your code will work. – Psidom Apr 28 '16 at 00:00
  • 1
    Variables created in a `*apply` function are scoped to that function unless you use `<<-`, which is not usually recommended. The usual strategy is to use `lapply` to make a [list of data.frames](http://stackoverflow.com/questions/17499013/how-do-i-make-a-list-of-data-frames/24376207#24376207) with something like `list_of_data.frames <- lapply(files, function(x){read.table(x, skip = 3, header = TRUE, stringsAsFactors = FALSE)})`, and then combine them into a single data.frame with `do.call(rbind, list_of_data.frames)` if you like. – alistaire Apr 28 '16 at 00:04

3 Answers3

11

do.call() is your friend.

big.list.of.data.frames <- lapply(files, function(x){
    read.table(x, skip = 3, header = TRUE, stringsAsFactors = FALSE)
})

or more concisely (but less-tinkerable):

big.list.of.data.frames <- lapply(files, read.table, 
                                  skip = 3,header = TRUE,
                                  stringsAsFactors = FALSE)

Then:

big.data.frame <- do.call(rbind,big.list.of.data.frames)

This is a recommended way to do things because "growing" a data frame dynamically in R is painful. Slow and memory-expensive, because a new frame gets built at each iteration.

Jason
  • 2,507
  • 20
  • 25
  • This solved the problem perfectly. Of COURSE I should have been passing the results of lapply() into a variable at the beginning of the function, instead of trying to save them into a variable within lapply(). And thanks for pointing me towards do.call(). – John Lynch May 05 '16 at 17:25
6

You can use map_df from purrr package instead of lapply, to directly have all results combined as a data frame.

map_df(files, read.table, skip = 3, header = TRUE, stringsAsFactors = FALSE)
Ricky
  • 4,616
  • 6
  • 42
  • 72
4

Another option is fread from data.table

library(data.table)
rbindlist(lapply(files, fread, skip=3))
akrun
  • 874,273
  • 37
  • 540
  • 662