0

I am a beginner in R and have recently transitioned from STATA to R. So, it's been an uphill battle. I was able to write a vectorized command to read csv files recursively as discussed here Sapply vs. Lapply while reading files with factors. Here's my code:

filenames<-list.files(path="~/Documents/R Programming/Data/",pattern=".csv")
appended_filename<-sapply(filenames, function(x) paste("~/Documents/R Programming/Data/",x,sep = ""))

Merged_file<-do.call(rbind,lapply(appended_filename,read.csv))

However, I have about 50+ files. The challenge is that there is no way I can know whether there is an issue with reading any of the files. Is there any way to print the status such as "1 2 ..." (I am not looking for anything pretty...just an update on what's going on) just to know how many files have been read?

I am a beginner so I am not sure how to add a function that would show me some visibility in this. As a fall-back option, I have manually coded read.csv() function to test and check each file and finally rbind() function before running the above command. This is extremely painful.

Community
  • 1
  • 1
watchtower
  • 4,140
  • 14
  • 50
  • 92

2 Answers2

2

You can use an anonymous function in your lapply, as you do in the sapply above. Then in the function, you can print out the filename, read it in, do anything else you want to. So instead of lapplying read.csv to each appended_filename, you can do something like this:

do.call(rbind, lapply(appended_filename, function(x) {print(x); read.csv(x)}))

You can also use the method rbind.fill (in the plyr library) to combine a list of dataframes. This is a little cleaner than do.call.

rbind.fill(lapply(appended_filename, function(x) {print(x); read.csv(x)}))
jdoubleyou
  • 297
  • 2
  • 9
  • This is fantastic response. A quick question: could you please explain the last part of your response ..."You can also use the method rbind.fill (in the plyr library) to combine a list of dataframes. This is a little cleaner than do.call." I am not quite sure what you mean by "cleaner". I'd sincerely appreciate your thoughts. I am a beginner so I apologize if this question is too naive. – watchtower Sep 28 '16 at 17:06
  • 1
    It's personal preference, but from the docs, 'do.call constructs and executes a function call...'. This makes it a sort of meta-function, which takes in as arguments the function and the list. In this approach, you need to use two functions (do.call and lapply). rbind.fill wraps that specific functionality (rbinding a list of dataframes) into 1 function. Using rbind.fill is more direct, since you only need to call that one function. I think it's worth your time to learn about the functions and philosophy of plyr and dplyr. Try the creator's book, [R for Data Science](http://r4ds.had.co.nz/). – jdoubleyou Sep 28 '16 at 17:53
1

Progress bars may be a better way to go:

library(purrr)
library(dplyr)

td <- tempdir()

# Make 100 copies of mtcars in a temporary directory
walk(1:100, ~write.csv(mtcars, file.path(td, sprintf("mtcars%02d.csv", .)), row.names=FALSE))

# Get a list of the files. dir() == list.files(), just shorter
fils <- dir(td, pattern=".csv", full.names=TRUE)

# Inspect the list
head(fils)
## [1] "/var/folders/3r/zg9pcxys4dqg4j7_bqbn3c0h0000gn/T//RtmpW0AVZ2/mtcars01.csv"
## [2] "/var/folders/3r/zg9pcxys4dqg4j7_bqbn3c0h0000gn/T//RtmpW0AVZ2/mtcars02.csv"
## [3] "/var/folders/3r/zg9pcxys4dqg4j7_bqbn3c0h0000gn/T//RtmpW0AVZ2/mtcars03.csv"
## [4] "/var/folders/3r/zg9pcxys4dqg4j7_bqbn3c0h0000gn/T//RtmpW0AVZ2/mtcars04.csv"
## [5] "/var/folders/3r/zg9pcxys4dqg4j7_bqbn3c0h0000gn/T//RtmpW0AVZ2/mtcars05.csv"
## [6] "/var/folders/3r/zg9pcxys4dqg4j7_bqbn3c0h0000gn/T//RtmpW0AVZ2/mtcars06.csv"

# Use a progress bar based on total # of files to read
pb <- progress_estimated(length(fils))

map_df(fils, function(x) {  # map_df will automagically append all the data frames together
  pb$tick()$print()         # increment the progress bar
  read.csv(x)
}) -> df

# see what we've got
glimpse(df)
## Observations: 3,200
## Variables: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
## $ cyl  <int> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
## $ hp   <int> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3...
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2...
## $ vs   <int> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ am   <int> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ gear <int> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
## $ carb <int> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...

# cleanup those files
walk(fils, unlink)
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205