In R, how to define column types only once, when loading multiple csv files?

Question

Following is legit, because consolidating data frames in R has not the answer, nor has How to make a great R reproducible example? .

I have a dataset splitted in multiple csv files without headers. For a single import, I use:

X <- read_delim( ... ,
                 ... ,
                 col_types = col(    X1 = "c" ,
                                     ...      ,  
                                   X100 = "i"  )
               )

To import all, I simply repeat the above.

I'd like to shorten the code, though.

Is it possible to supply the column definitions for col() to the read_delim by only defining it once? I've tried to supply a c=() list, but it doesn't work.

*list(X1 = "c" , ... , X100 = "i")* works. – Perry's Mar 21 '18 at 05:57 — Perry's, Mar 21 '18 at 05:57

Dodge · Answer 1 · 2018-03-21T06:36:36.143

1

A solution with lapply() :

You can set the working directory to a folder containing your files and then create a list of file paths for all of the files that contain ".csv" in that directory. Finally, you can use lapply to apply the read.csv function over the list of file paths. I think you should use read.csv because you have .csv files. You can set your colClasses in the call to lapply and they will be read the same for all of the .csv files you have placed in your working directory.

Link to lapply() documentation

You can try something like this:

setwd( "C:/path/to/directory/containing/files/here/")   

file.paths <- list.files(pattern = '.csv')

column_classes <- c("character", "numeric", "numeric") # specify for all columns   

my.files <- lapply(file.paths, function(x) read.csv(x, colClasses= column_classes))

edited Mar 21 '18 at 06:36

answered Mar 20 '18 at 13:43

Dodge

3,219
3
19
38

Highly elegant! I've used *read_delim*, because it's the console code used in RStudio after using the point-click import button. Total n00b! Will have to find out what *lapply* does; haven't encountered it, yet. – Perry's Mar 21 '18 at 05:56
@Perry's There was an error in my code, but I've edited my answer. My apologies. – Dodge Mar 21 '18 at 06:38

score -1 · Answer 2 · answered Mar 20 '18 at 13:43

if you want to make great code, which it seems you do, shouldn't repeat yourself. What if you get handed another 100 csv files? You won't want to change your code every time. So, you shouldn't just copy and paste your lines of code if you want to do something multiple times.

Don't repeat yourself

I think the best way here is to define a custom function which reads the file with those parameters you have used. Then, get a list of all the files you want to read. This can be typed manually or you can use something like list.files to get names of files in a directory. Then, you can use lapply or purrr::map to apply your custom function to each of those filenames.

library(readr)
library(purrr)

read_my_file <- function(filename){
  read_delim( ... ,
              ... ,
              col_types = col(    X1 = "c" ,
                                  ...      ,  
                                  X100 = "i"  )
  )
}


filenames <- c("one.csv", "two.csv", "three.csv")

dataframes <- map(filenames, read_my_file)

If you want to then concatenate all the dataframes (by rows) into one large one, use map_dfr in place of map.

I think this can be done without any libraries or custom functions. — Dodge, Mar 20 '18 at 13:52
Yes, a function would handle a lot of my described problems. A core problem, however, was to repetitively assign the same column types. In a version history of *readr* I found out that you can define a *list(X1="c" ... X100 = "i")* and use that list, instead of *c()*, as column definition. With that, I'm glad to take your function scaffold, too. I ignored functions before, because I'm a n00b and loading data in any way was an achievement. After all, I haven't used some pre-made mtcars dataset, but spent weeks on merely acquiring the data ... — Perry's, Mar 21 '18 at 05:52

In R, how to define column types only once, when loading multiple csv files?

2 Answers2

A solution with lapply() :