106

I am doing a relatively simple piece of analysis that I have put into a function on all the files in a particular folder. I was wondering whether anyone had any tips to help me automate the process on a number of different folders.

  1. Firstly, I was wondering whether there was a way of reading all the files in a particular folder straight into R. I believe the following command will list all the files:

files <- (Sys.glob("*.csv"))

...which I found from Using R to list all files with a specified extension

And then the following code reads all those files into R.

listOfFiles <- lapply(files, function(x) read.table(x, header = FALSE)) 

…from Manipulating multiple files in R

But the files seem to be read in as one continuous list and not individual files… how can I change the script to open all the csv files in a particular folder as individual dataframes?

  1. Secondly, assuming that I can read all the files in separately, how do I complete a function on all these dataframes in one go. For example, I have created four small dataframes so I can illustrate what I want:

     Df.1 <- data.frame(A = c(5,4,7,6,8,4),B = (c(1,5,2,4,9,1)))
     Df.2 <- data.frame(A = c(1:6),B = (c(2,3,4,5,1,1)))
     Df.3 <- data.frame(A = c(4,6,8,0,1,11),B = (c(7,6,5,9,1,15)))
     Df.4 <- data.frame(A = c(4,2,6,8,1,0),B = (c(3,1,9,11,2,16)))
    

I have also made up an example function:

Summary<-function(dfile){
SumA<-sum(dfile$A)
MinA<-min(dfile$A)
MeanA<-mean(dfile$A)
MedianA<-median(dfile$A)
MaxA<-max(dfile$A)

sumB<-sum(dfile$B)
MinB<-min(dfile$B)
MeanB<-mean(dfile$B)
MedianB<-median(dfile$B)
MaxB<-max(dfile$B)

Sum<-c(sumA,sumB)
Min<-c(MinA,MinB)
Mean<-c(MeanA,MeanB)
Median<-c(MedianA,MedianB)
Max<-c(MaxA,MaxB)
rm(sumA,sumB,MinA,MinB,MeanA,MeanB,MedianA,MedianB,MaxA,MaxB)

Label<-c("A","B")
dfile_summary<-data.frame(Label,Sum,Min,Mean,Median,Max)
return(dfile_summary)}

I would ordinarily use the following command to apply the function to each individual dataframe.

Df1.summary<-Summary(dfile)

Is there a way instead of applying the function to all the dataframes, and use the titles of the dataframes in the summary tables (i.e. Df1.summary).

Many thanks,

Katie

AndrewGB
  • 16,126
  • 5
  • 18
  • 49
KT_1
  • 8,194
  • 15
  • 56
  • 68

3 Answers3

129

On the contrary, I do think working with list makes it easy to automate such things.

Here is one solution (I stored your four dataframes in folder temp/).

filenames <- list.files("temp", pattern="*.csv", full.names=TRUE)
ldf <- lapply(filenames, read.csv)
res <- lapply(ldf, summary)
names(res) <- substr(filenames, 6, 30)

It is important to store the full path for your files (as I did with full.names), otherwise you have to paste the working directory, e.g.

filenames <- list.files("temp", pattern="*.csv")
paste("temp", filenames, sep="/")

will work too. Note that I used substr to extract file names while discarding full path.

You can access your summary tables as follows:

> res$`df4.csv`
       A              B        
 Min.   :0.00   Min.   : 1.00  
 1st Qu.:1.25   1st Qu.: 2.25  
 Median :3.00   Median : 6.00  
 Mean   :3.50   Mean   : 7.00  
 3rd Qu.:5.50   3rd Qu.:10.50  
 Max.   :8.00   Max.   :16.00  

If you really want to get individual summary tables, you can extract them afterwards. E.g.,

for (i in 1:length(res))
  assign(paste(paste("df", i, sep=""), "summary", sep="."), res[[i]])
chl
  • 27,771
  • 5
  • 51
  • 71
  • 4
    +1 I would `plyr::llply` (or `ldply`) instead of `lapply` to preserve the names throughout, and define my own summary function, e.g. `plyr::each(min, max, mean, sd, median)` – baptiste Mar 05 '12 at 10:45
  • +1 @chl: thanks for the fullnames trick in list.files function....i forgot it in my answer !!! – dickoa Mar 05 '12 at 10:51
  • @baptiste (+1) Thanks for the `plyr` suggestion. – chl Mar 05 '12 at 11:07
  • Thanks @chl. How do I use the above code with a function that I have written? The example function that I used above ("Summary")with sum, mean, median etc. was just used as an example that I created quickly - the real function that I am using for my actual analysis is much more complex. Any ideas of how I incorporate a more complex function into the above code to give the same individual summary tables? – – KT_1 Mar 05 '12 at 14:59
  • @Katie I guess you can replace `summary` with any function of yours, provided it takes a data.frame as an argument (and/or optional parameters that are constant across the difference DFs). E.g., `lapply(ldf, function(x) apply(x, 2, function(x) c(mean(x), sd(x))))` would return mean and SD computed colwise. – chl Mar 05 '12 at 15:29
  • This works well, but i'd like to see how the `res` object can be converted in a single `dataframe` that contains all records from the files. Can anyone offer a solution? @chl any ideas? – Ben Sep 07 '18 at 13:49
18

usually i don't use for loop in R, but here is my solution using for loops and two packages : plyr and dostats

plyr is on cran and you can download dostats on https://github.com/halpo/dostats (may be using install_github from Hadley devtools package)

Assuming that i have your first two data.frame (Df.1 and Df.2) in csv files, you can do something like this.

require(plyr)
require(dostats)

files <- list.files(pattern = ".csv")


for (i in seq_along(files)) {

    assign(paste("Df", i, sep = "."), read.csv(files[i]))

    assign(paste(paste("Df", i, sep = ""), "summary", sep = "."), 
           ldply(get(paste("Df", i, sep = ".")), dostats, sum, min, mean, median, max))

}

Here is the output

R> Df1.summary
  .id sum min   mean median max
1   A  34   4 5.6667    5.5   8
2   B  22   1 3.6667    3.0   9
R> Df2.summary
  .id sum min   mean median max
1   A  21   1 3.5000    3.5   6
2   B  16   1 2.6667    2.5   5
dickoa
  • 18,217
  • 3
  • 36
  • 50
  • (+1) It looks like we answered quite at the same time and your `plyr` solution is quite nice! – chl Mar 05 '12 at 11:09
  • 1
    Thanks @dickoa for your answers. The function that I made up ("Summary") was poorly described. I was just using it for illustrative purposes - my real function is much more complicated so I was wondering how the above code (and probably my function) could be changed so that it is applied for all the different data frames (and doesn't just use the in built functions in R). – KT_1 Mar 05 '12 at 11:58
8

Here is a tidyverse option that might not the most elegant, but offers some flexibility in terms of what is included in the summary:

library(tidyverse)
dir_path <- '~/path/to/data/directory/'
file_pattern <- 'Df\\.[0-9]\\.csv' # regex pattern to match the file name format

read_dir <- function(dir_path, file_name){
  read_csv(paste0(dir_path, file_name)) %>% 
    mutate(file_name = file_name) %>%                # add the file name as a column              
    gather(variable, value, A:B) %>%                 # convert the data from wide to long
    group_by(file_name, variable) %>% 
    summarize(sum = sum(value, na.rm = TRUE),
              min = min(value, na.rm = TRUE),
              mean = mean(value, na.rm = TRUE),
              median = median(value, na.rm = TRUE),
              max = max(value, na.rm = TRUE))
  }

df_summary <- 
  list.files(dir_path, pattern = file_pattern) %>% 
  map_df(~ read_dir(dir_path, .))

df_summary
# A tibble: 8 x 7
# Groups:   file_name [?]
  file_name variable   sum   min  mean median   max
  <chr>     <chr>    <int> <dbl> <dbl>  <dbl> <dbl>
1 Df.1.csv  A           34     4  5.67    5.5     8
2 Df.1.csv  B           22     1  3.67    3       9
3 Df.2.csv  A           21     1  3.5     3.5     6
4 Df.2.csv  B           16     1  2.67    2.5     5
5 Df.3.csv  A           30     0  5       5      11
6 Df.3.csv  B           43     1  7.17    6.5    15
7 Df.4.csv  A           21     0  3.5     3       8
8 Df.4.csv  B           42     1  7       6      16
sbha
  • 9,802
  • 2
  • 74
  • 62
  • Great solution as it is very flexible. For my data format `read_csv()` did not properly work so I replaced it with `data.table::fread()`. – Thorsten Oct 23 '20 at 09:14