1

I want to construct a data frame by reading in a csv file for each day in the month. My daily csv files contain columns of characters, doubles, and integers of the same number of rows. I know the maximum number of rows for any given month and the number of columns remains the same for each csv file. I loop through each day of a month with fileListing, which contains the list of csv file names (say, for January):

output <- matrix(ncol=18, nrow=2976)
for ( i in 1 : length( fileListing ) ){
    df = read.csv( fileListing[ i ], header = FALSE, sep = ',', stringsAsFactors = FALSE, row.names = NULL )
    # each df is a data frame with 96 rows and 18 columns

    # now insert the data from the ith date for all its rows, appending as you go
        for ( j in 1 : 18 ){        
            output[ , j ]   = df[[ j ]]
        }
}

Sorry for having revised my question as I figured out part of it (duh), but should I use rbind to progressively insert data at the bottom of the data frame, or is that slow?

Thank you.

BSL

Benjamin Levy
  • 333
  • 6
  • 19

3 Answers3

3

You can read them into a list with lapply, then combine them all at once:

data <- lapply(fileListing, read.csv, header = FALSE, stringsAsFactors = FALSE, row.names = NULL)
df <- do.call(rbind.data.frame, data)
Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112
3

First define a master dataframe to hold all of the data. Then as each file read, append the data onto the master.

masterdf<-data.frame()
for ( i in 1 : length( fileListing ) ){
  df = read.csv( fileListing[ i ], header = FALSE, sep = ',', stringsAsFactors = FALSE, row.names = NULL )
  # each df is a data frame with 96 rows and 18 columns
  masterdf<-rbind(masterdf, df)
}

At the end of the loop, masterdf will contain all of the data. This code code can be improved but for the size of the dataset this should be quick enough.

Dave2e
  • 22,192
  • 18
  • 42
  • 50
1

If the data is fairly small relative to your available memory, just read the data in and don't worry about it. After you have read in all the data and done some cleaning, save the file using save() and have your analysis scripts read in that file using load(). Separating reading/cleaning scripts from analysis clips is a good way to reduce this problem.

A feature to speed up the reading of read.csv is to use the nrow and colClass arguments. Since you say that you know that number of rows in each file, telling R this will help speed up the reading. You can extract the column classes using

colClasses <- sapply(read.csv(file, nrow=100), class)

then give the result to the colClass argument.

If the data is getting close to being too large, you may consider processing individual files and saving intermediate versions. There are a number of related discussions to managing memory on the site that cover this topic.

On memory usage tricks: Tricks to manage the available memory in an R session

On using the garbage collector function: Forcing garbage collection to run in R with the gc() command

Community
  • 1
  • 1
lmo
  • 37,904
  • 9
  • 56
  • 69
  • I thought of that step, but I still want to write the monthly collection of daily files, such that the 2nd day is appended to the bottom of the 1st day of data in the monthly data frame. Thanks. – Benjamin Levy Apr 06 '16 at 20:46
  • Put in a couple edits regarding the colClass and nrow arguments. These will help with read time and memory usage. Using rbind will be fast on moderately sized datasets. – lmo Apr 06 '16 at 20:54