Importing multiple csv file from different folder and extracting the filename as additionnal columns: headers faith and multifolders case

Question

Use of ldply (package "plyr") to import multiple csv files from a folder: header faith, and how to do it for multiple folders?

set up:
- Desktop: MacBook Pro (Early 2011) with iOS 10.13.6
- Software version: R version 3.5.1 (2018-07-02) -- "Feather Spray"
- R studio: Version 1.1.456

I would like to import multiple csv files from specific folders and merge them into one file with 5 columns: Variable1/Variable2/file_name/experiment_nb/pulse_nb I have managed to make the importation of all files from the same folder from previous similar questions in StackOverflow in the same data.frame, however, I am not sure of how to do it for different folder and the faith of header of each file after merge, . As the file are too big to handle manually (200 000 lines per files), I want to make sure there is not any mistake that would cause all subsequent analysis to fail, such as the line of the header before the data of each csv file imported

The csv looks like this: "20190409-0001_002.csv" with the date, followed by the name of the experiment (0001) in the example, and the number of the pulse (002)

#setting package and directory
library(plyr)
library(stringr)
setwd("/Users/macbook/Desktop/Project_Folder/File_folder1")

#Creating a list of all the filenames: 
filenames <- list.files(path = "/Users/macbook/Desktop/Project_Folder/File_folder1")

#creating a function to read csv and in the same time adding an additional column with the name of the file
read_csv_filename <- function(filename)
  {
  ret <- read.csv(filename, header=TRUE, sep=",") 
  ret$Source <- filename #EDIT
  ret
}

#importing
import <- ldply(filenames, read_csv_filename)

#making a copy of import
data<-import

#modifying the file name so it removes ".csv" and change the header
data$Source<-str_sub(data$Source, end=-5)
data[1,3]<-"date_expnb_pulsenb"
t<-substr(data[1,3],1,3) 
head(data, n=10)

#create a column with the experiment number, extracted from the file name
data$expnb<-substr(data$Source, 10, 13)
data$expnb<-as.numeric(data$expnb)
head(data, n=10)
tail(data, n=10)

1° Now I need to manage to import all the other folders in the same files, which I could eventually do manually because the number of folder is manually doable (9-10), but I am considering making a code for this as well for future experiments with big number of experiments. How to do that ? to first list all folder, then list all files from those folder, and then regroup them in one list files ? Is this doable with list.files ? The folder name will looks like this: "20190409-0001"

2° The result from the code above (head(data, n=10)) looks like this:


> head(data, n=10)
           Time   Channel.A            Source pulsenb expnb
1          (us)         (A)     expnb_pulsenb      NA    NA
2  -20.00200030 -0.29219970 20190409-0001_002       2     1
3  -20.00100030 -0.29219970 20190409-0001_002       2     1

and


> tail(data, n=10)
                 Time   Channel.A            Source pulsenb expnb
20800511 179.99199405 -0.81815930 20190409-0001_105     105     1
20800512 179.99299405 -0.81815930 20190409-0001_105     105     1

I would like to run extensive data analysis on the now big list, and I am wondering how to check that in the middle of them I do not have some line with file headers. As the headers as the same in the csv file, does the ldply function already takes into account the headers? Would all the file header be in a separate line in the "import" data frame ? How to check that? (unfortunately, there is around 200 XXX lines in each file so I can not really manually check for headers).

I hope I have added all the required details and put the questions in the right format as it is my first time posting here :)

Thank you guys in advance for your help!

Is there a way to determine the names of the folders? Do they share a common string/name that other folders do not have? Or perhaps their names can be contructed programmatically? — Wimpel, Jun 03 '19 at 09:20
further: you ask a lot of questions in one 'question' . Perhaps it is better to split this post into different 'single' questions, each with their own reproducible example. — Wimpel, Jun 03 '19 at 09:22

score 2 · Answer 1 · answered Jun 03 '19 at 11:13

I have created a sham environnement of folders and files, assuming that you would logically regroup all your files and folders.

# ---
# set up folders and data
lapply( as.list(paste0("iris", 1:3)), dir.create )

iris_write <- function(name) write.csv(x = iris, file = name)
lapply( as.list(paste0("iris", 1:3, "/iris", 1:3, ".csv")), iris_write)

# Supposing you got them all in one folder, one level up
ldir <- list.dirs()
ldir <- ldir[stringr::str_detect(string = ldir, pattern = "iris")] # use 20190409-0001 in your case

# Getting all files
lfiles <- lapply( as.list(ldir), list.files )

# Getting all path
path_fun <- function(dirname) paste0(dirname, "/", list.files(dirname) )
lpath <- lapply( as.list(ldir), path_fun )

Using r base or/and the package data.table

# ---
# --- Import, with functions that detect automatically headers, sep + are way faster to read data

# *** Using data.table

library(data.table)
read_csv_filename <- function(filename){  
        ret <- fread(filename)
        ret$Source <- filename #EDIT
        ret
}

ldata <- lapply( lpath , read_csv_filename )

# --- if you want to regroup them
# with r base
df_final <- do.call("rbind", ldata)
# using data.table
df_final <- rbindlist(ldata)

Using package dplyr

# *** using dplyr
library(dplyr)
read_csv_filename2 <- function(filename){  
        ret <- reader(filename)
        ret$Source <- filename #EDIT
        ret
}

ldata <- lapply( lpath , read_csv_filename )
df_final <- bind_rows(ldata)

# you may do this with plyr::ldply also
df_final2 <- plyr::ldply(ldata, data.frame)

# *** END loading

Last suggestion : file_path_sans_ext from the package tools

# modifying the file name so it removes ".csv" and change the header
library(tools)
data$Source <- tools::file_path_sans_ext( data$Source )

#create a column with the experiment number, extracted from the file name
data$expnb <- substr(data$Source, 10, 13)
data$expnb <- as.numeric(data$expnb)

Hope this help :)

FYI, you can do a lot of the above steps in one go. See this https://stackoverflow.com/a/48105838/786542 — Tung, Dec 19 '19 at 22:38
Thanks for the tip on your nice answer @Tung ! Btw are the times in your barplot given as a one time experiment or are they means of a microbenchmark ? I see Dirk Eddelbuettel also provided an answer, I will look into that also. — cbo, Dec 20 '19 at 14:16
The benchmark was done by `vroom` package owner. The codes are here https://github.com/r-lib/vroom/tree/master/inst/bench — Tung, Dec 20 '19 at 15:11

score 1 · Accepted Answer · answered Jun 03 '19 at 11:33

I'll add my solution, too using purrr's map_dfr

Generate Data

This will just generate a lot of csv files in a temp directory for us to manipulate. This is a good approach for helping us answer questions for you.

library(tidyverse)
library(fs)

temp_directory <- tempdir()

library(nycflights13)
library(nycflights13)
purrr::iwalk(
  split(flights, flights$carrier),
  ~ { str(.x$carrier[[1]]); vroom::vroom_write(.x, paste0(temp_directory,"/", glue::glue("flights_{.y}.csv")), 
                                               delim = ",") }
)

Custom Function

It looks like you have a custom function to read in some information because the format of the file might be different. Here's my hack at what you were doing.

# List of files
my_files <- fs::dir_ls(temp_directory, glob = "*.csv")

custom_read_csv <- function(file){
  # Read without colnames
  ret <- read_csv(file, col_names = FALSE)

  # Pull out column names
  my_colnames <- unlist(ret[1,])

  # Remove the row
  ret <- ret[-1,]

  # Assign the column names
  colnames(ret) <- my_colnames

  # Trick to remove the alpha in a row you know should be time
  ret <- filter(ret, !is.na(as.numeric(T)))

}

Now you can read in all of your files with the custom function and combine into a single dataframe using map_dfr:

all_files <- map_dfr(my_files, custom_read_csv, .id = "filename")

head(all_files)

Which looks like this:

> head(all_files)
# A tibble: 6 x 20
  filename year  month day   dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum
  <chr>    <chr> <chr> <chr> <chr>    <chr>          <chr>     <chr>    <chr>          <chr>     <chr>   <chr>  <chr>  
1 C:/User~ 2013  1     1     810      810            0         1048     1037           11        9E      3538   N915XJ 
2 C:/User~ 2013  1     1     1451     1500           -9        1634     1636           -2        9E      4105   N8444F 
3 C:/User~ 2013  1     1     1452     1455           -3        1637     1639           -2        9E      3295   N920XJ 
4 C:/User~ 2013  1     1     1454     1500           -6        1635     1636           -1        9E      3843   N8409N 
5 C:/User~ 2013  1     1     1507     1515           -8        1651     1656           -5        9E      3792   N8631E

The you could remove the root path using the following syntax(my path is in there now):

all_files %>% 
  mutate(filename = str_remove(filename, "C:/Users/AppData/Local/Temp/RtmpkdmJCE/"))