0

I am trying to read multiple csvs into R and then subset those csvs by removing columns I don't need using the 'subset' function. i am trying to setup a for loop in r that I can add functions or calculation to a list of csvs in order to produce dataframes for ggplots or stat analysis later. (I currently have tidyverse, dplyr, and ggplot2 installed). Right now I just want to subset the csvs and then create a dataframe from the subsetted data.

I used a for loop to successfully read multiple csvs into separate dataframes by setting a working directory, creating a list of csvs, then reading them into dataframes. This currently outputs a dataframe for each csv named after the original filename:

filenames <- gsub("\\.csv$","", list.files(pattern="\\.csv$"))


for(i in filenames){
     assign(i, read.csv(paste(i, ".csv", sep="")))}

Then I realized I wanted to subset these data before putting them into the dataframes in order to avoid some repetitive code later; however, I am getting an error each time I tried to add a subset function to the for loop. This is what I currently have:

for(i in filenames){
  read.csv(i)
  subset(i, select = c("names", "of columns", "I want"))
  assign(i, read.csv(paste(i, ".csv", sep="")))
}

I receive a "no such file or directory error". I'm sure I'm missing something obvious since my R foundation is poor, but any help or advice to make this work would be appreciated. The subset function has worked for me in the past but I had to write out a new line for each dataframe and would like to avoid that by using a list and for loop or some other method.

Thank you

Kaliber
  • 23
  • 1
  • 5
  • 2
    Using `assign` isn't a great idea. See list alternative: https://stackoverflow.com/questions/11433432/how-to-import-multiple-csv-files-at-once. But here the variable `i` is just the file name. You need to save the result of `read.csv` somewhere or pass that directly to subset. But the `lapply` list strategy is generally going to be easier to work with in the long run. – MrFlick Feb 25 '19 at 19:40
  • Thank you I will try that out and get back to you – Kaliber Feb 25 '19 at 19:45

2 Answers2

2

Apparently, all csv files do have the same structure, i.e., same number and names of columns. Therefore, the suggestion by MrFlick and OP's own answer can be improved in several ways:

  1. The read.csv() function reads all columns. Therefore, a separate subsetting step is required to keep only the wanted columns. The fread() function from the data.table package has a select parameter to read only the wanted columns from file.
  2. rbindlist() is an abbreviation of do.call(rbind, ...) but has an additional parameter idcol. This will create an additional column which identifies the origin of each row.

Create list of data frames

lapply(list.files(pattern = "\\.csv$"), data.table::fread, 
       select = c("names.of", "columns", "I.want"))
[[1]]
   names.of columns I.want
1:        1       2      3

[[2]]
   names.of columns I.want
1:       21      22     23

Note that only selected columns are read from files.

Create one large dataframe

library(data.table)
library(magrittr)   # piping used here to improve readability
lapply(list.files(pattern = "\\.csv$"), fread, select = c("names.of", "columns", "I.want")) %>% 
  rbindlist(idcol = TRUE)
   .id names.of columns I.want
1:   1        1       2      3
2:   2       21      22     23
3:   2       31      32     33

Note that the .id column gives the sequence number of list elements.

Create one large dataframe with originating file names

library(data.table)
library(magrittr)
filenames = list.files(pattern = "\\.csv$")
lapply(filenames, fread, select = c("names.of", "columns", "I.want")) %>% 
  set_names(filenames) %>% 
  rbindlist(idcol = "origin")
      origin names.of columns I.want
1: test1.csv        1       2      3
2: test2.csv       21      22     23
3: test2.csv       31      32     33

Here, set_names() from the magrittr package is used to name the list elements. Then, rbindlist() uses the names of the list elements for the id column.

Sample data

If have created two files.

test1.csv contains one row and four columns:

"names.of", "columns", "I.want", "useless.column"
1, 2, 3, 4

test2.csv contains two rows and five columns:

"names.of", "columns", "I.want", "useless.column", "another.useless.column"
21, 22, 23, 24, 25
31, 32, 33, 34, 35

Note that I have modified the column names to ensure that they are syntactically valid variable names.

Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134
0

I ended up using @MrFlick suggestion and circumvented for loops entirely using lapply and combining all files in one dataframe and subsetting from there. final result below:

filenames = list.files(pattern="*.csv")
filenames

myfiles = do.call(rbind, lapply(filenames, function(x) read.csv(x, stringsAsFactors = FALSE)))

myfiles


myfiles.subset = subset(myfiles, select = c("names of", "columns", "I want")
Kaliber
  • 23
  • 1
  • 5