22

I have a large number of csv files that I want to read into R. All the Column headings in the csvs are the same. At first I thought I would need to create a loop based on the list of file names, but after searching I found a faster way. This reads in and combines all the csvs correctly (as far as i know).

filenames <- list.files(path = ".", pattern = NULL, all.files = FALSE, full.names = FALSE, recursive = FALSE, ignore.case = FALSE)

library(plyr)
import.list <- llply(filenames, read.csv)

combined <- do.call("rbind", import.list)

The only problem is that I want to know which csv a specific row of data comes from. I want a column labeled 'source' that contains the name of the csv that the particular row came from. so for example if the csv was called Chicago_IL.csv when the data got into R the row would look something like this:

> City    State   Market  etc Source  
> Burbank IL      Western etc Chicago_IL
Arndt
  • 223
  • 1
  • 2
  • 4

7 Answers7

27

You have already done all the hard work. With a fairly small modification this should be straight-forward.

The logic is:

  1. Create a small helper function that reads an individual csv and adds a column with the file name.
  2. Call this helper function in llply()

The following should work:

read_csv_filename <- function(filename){
    ret <- read.csv(filename)
    ret$Source <- filename #EDIT
    ret
}

import.list <- ldply(filenames, read_csv_filename)

Note that I have proposed another small improvement to your code: read.csv() returns a data.frame - this means you can use ldply() rather than llply().

Andrie
  • 176,377
  • 47
  • 447
  • 496
  • 3
    using rep is optional, `ret$Source <- filename` would work too – Karsten W. Mar 03 '11 at 22:19
  • @Karsten W, thank you. I have edited the code to reflect your suggestion. – Andrie Mar 03 '11 at 23:30
  • This may be a somewhat silly question but... When using your change (ldply()), the data is reformatted in a way that all the column headings are transfered to row headings and the filenames (with numbers) become the column headings. Is there a way around this? Is there a disadvantage to using llply? – Arndt Mar 04 '11 at 14:56
  • There is no inherent disadvantage in using llply rather than ldply. The only difference is that if you wish to get a data frame as output, ldply does this, so you might save a step. – Andrie Mar 04 '11 at 15:08
  • I see, thanks again, this function works exactly the way that i had hoped. – Arndt Mar 04 '11 at 15:11
  • I wondering how to achieve the above answer, but instead of returning the whole file name, just return the first two letters. With something like subtr()? – derelict Feb 04 '16 at 20:41
16

Try this:

do.call("rbind", sapply(filenames, read.csv, simplify = FALSE))

The row names will indicate the source and line number.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
6

Here is a solution using the import_list() function from rio, which is designed exactly for this purpose.

# setup some example files to import
rio::export(mtcars, "mtcars1.csv")
rio::export(mtcars, "mtcars2.csv")
rio::export(mtcars, "mtcars3.csv")

The default behavior of import_list() is to get a list of data frames:

str(rio::import_list(dir(pattern = "mtcars")), 1)
## List of 3
##  $ :'data.frame':       32 obs. of  11 variables:
##  $ :'data.frame':       32 obs. of  11 variables:
##  $ :'data.frame':       32 obs. of  11 variables:

But you can use the rbind argument to instead construct a single data frame (note the _file column at the end):

str(rio::import_list(dir(pattern = "mtcars"), rbind = TRUE))
## 'data.frame':   96 obs. of  12 variables:
##  $ mpg  : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl  : int  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp : num  160 160 108 258 360 ...
##  $ hp   : int  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt   : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec : num  16.5 17 18.6 19.4 17 ...
##  $ vs   : int  0 0 1 1 0 1 0 1 1 1 ...
##  $ am   : int  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear : int  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb : int  4 4 1 1 2 1 4 2 2 4 ...
##  $ _file: chr  "mtcars1.csv" "mtcars1.csv" "mtcars1.csv" "mtcars1.csv" ...

and the rbind_label argument to specify the name of the column that identifies each file:

str(rio::import_list(dir(pattern = "mtcars"), rbind = TRUE, rbind_label = "source"))
## 'data.frame':   96 obs. of  12 variables:
##  $ mpg   : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl   : int  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp  : num  160 160 108 258 360 ...
##  $ hp    : int  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat  : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt    : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec  : num  16.5 17 18.6 19.4 17 ...
##  $ vs    : int  0 0 1 1 0 1 0 1 1 1 ...
##  $ am    : int  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear  : int  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb  : int  4 4 1 1 2 1 4 2 2 4 ...
##  $ source: chr  "mtcars1.csv" "mtcars1.csv" "mtcars1.csv" "mtcars1.csv" ...

For full disclosure: I am the maintainer of rio.

Thomas
  • 43,637
  • 12
  • 109
  • 140
4

data.table solution

Update: here is a complete data.table solution for this, using the keep.rownames. Assuming all your CSVs are in one folder:

library(data.table)
my.path <- "C:/some/path/to/your/folder" #set the path
filenames <- paste(my.path, list.files(path=my.path), sep="/") #list of files

#this will create a rn column with the path in it
my.dt<- data.table(do.call("rbind", sapply(filenames, read.csv,     
                  sep=";")), keep.rownames = T)

Basic syntax solution

I used Grothendieck's solution and added a line to create a column from the row names. As simple as that:

something <- do.call("rbind", sapply(filenames, read.csv, sep=";", simplify = FALSE)) 
something$mycolumn <- row.names(something)

If you only want a part of the filename, replace the 2nd line by something like this:

something$mycolumn <- substring(row.names(something),1,3)

This will use the 1st 3 characters from the filename as the value in the new column.

GuillaumeL
  • 985
  • 8
  • 11
1

Found this one working for me, which creates new column plus merging whole folder csv files.

Using setNames():

file.list <- list.files(pattern = '*.csv')
file.list <- setNames(file.list, file.list)

df.list <- lapply(file.list, read_csv)
df.list <- Map(function(df, name) {
  df$issue <- name
  df
}, df.list, names(df.list))
df <- rbindlist(df.list,use.names = TRUE, fill = TRUE, idcol = "Issue")

This one creates new column of the source file, and merge them.

Jha Ayush
  • 67
  • 8
1

Kinda messy but works:

filenames <- c("foo.csv","bar.csv")
import.list <- list(matrix(,4,4),matrix(6,6))

source <- unlist(sapply(1:length(filenames),function(i)rep(gsub(".csv","",filenames[i]),nrow(import.list[[i]]))))

source
[1] "foo" "foo" "foo" "foo" "bar" "bar" "bar" "bar" "bar" "bar"

combined$source <- source
Sacha Epskamp
  • 46,463
  • 20
  • 113
  • 131
0

Here's a piped solution that works if all csvs have the same number of columns and column names:

library(tidyverse)
library(fs)

# read a directory of csvs into list, add filename as column, bind to df
df <- dir_ls("data/path_to_csvs") %>% 
  map_df(~read_csv(.x) %>% mutate(file = .x)) 

If you want just the base file name and not the entire local path, use basename():

df <- dir_ls("data/path_to_csvs") %>% 
  map_df(~read_csv(.x) %>% mutate(file = basename(.x))) 
Rich Pauloo
  • 7,734
  • 4
  • 37
  • 69