3

I have to read a CSV (each more than 120MB). I use a for loop, but it was very very very slow. How can I read a CSV more quickly?

My code:

H=data.frame()
for (i in 201:225){
    for (j in 1996:2007){
        filename=paste("D:/Hannah/CD/CD.R",i,"_cd",j,".csv",sep="")
        x=read.csv(filename,stringsAsFactor=F)
        I=c("051","041","044","54","V0262")
        temp=x[(x$A_1 %in% I)|(x$A_2 %in% I)|(x$A_3 %in% I), ]
        H=rbind(H,temp)
    }
}

each files structuration are same like this

> str(x)
'data.frame':   417691 obs. of  37 variables:
$ YM: int  199604 199612 199612 199612 199606 199606 199609 199601 ...
$ A_TYPE: int  1 1 1 1 1 1 1 1 1 1 ...
$ HOSP: chr  "dd0516ed3e" "c53d67027e" ...
$ A_DATE: int  19960505 19970116 19970108  ...
$ C_TYPE: int  19 9 1 1 2 9 9 1 1 1 ...
$ S_NO : int  142 37974 4580 4579 833 6846 2272 667 447 211 ...
$ C_ITEM_1 : chr  "P2" "P3" "A2"...
$ C_ITEM_2 : chr  "R6" "I3" ""...
$ C_ITEM_3 : chr  "W2" "" "A2"...
$ C_ITEM_4 : chr  "Y1" "O3" ""...
$ F_TYPE: chr  "40" "02" "02" "02" ...
$ F_DATE : int  19960415 19961223 19961227  ...
$ T_END_DATE: int  NA NA NA  ...
$ ID_B : int  19630526 19630526 19630526  ...
$ ID : chr  "fff" "fac" "eab"...
$ CAR_NO : chr  "B4" "B5" "C1" "B6" ...
$ GE_KI: int  4 4 4 4 4 4 4 4 4 4 ...
$ PT_N : chr  "H10" "A10" "D10" "D10" ...
$ A_1  : chr  "0521" "7948" "A310" "A312" ...
$ A_2  : chr  "05235" "5354" "" "" ...
$ A_3  : chr  "" "" "" "" ...
$ I_O_CE: chr  "5210" "" "" "" ...
$ DR_DAY : int  0 7 3 3 0 0 3 3 3 3 ...
$ M_TYPE: int  2 0 0 0 2 2 0 0 0 0 ...

........

Vivian
  • 309
  • 1
  • 10
  • So you have 25 .csv files where you need to extract some information into one data.frame? Are all these files consistent and identical in their structure? Which OS are you using? – vaettchen Aug 18 '13 at 05:26
  • @vaettchen yes 25.csv files in each year so 25*12(year)=300 files. all the files structure are consistent. do in the window 7. – Vivian Aug 18 '13 at 05:31
  • Can you add sample data, with `dput( head( x, 3 ) )`? If the data is simply structured, using RSQLite could be an option. Or have a look at the package `sqldf`. – vaettchen Aug 18 '13 at 05:47
  • 1
    You may have a look here: http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r – Henrik Aug 18 '13 at 06:01

3 Answers3

4

I think the big performance problem here is that you iteratively grow the H object. Each time the object grows, the OS needs to allocate more for it. This process takes quite long. A simple fix would be to preallocate H to the correct number of rows. If the number of rows is not known beforehand, you can preallocate a good amount, and resize as needed.

Alternatively, the following approach does not suffer form the problem I describe above:

list_of_files = list.files('dir_where_files_are', pattern = '*csv', full.names = TRUE)
big_data_frame = do.call('rbind', lapply(list_of_files, read.csv, sep = ""))
Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
  • yes,but i do in my notebook so the memory are shortage. can't do the rbind all files – Vivian Aug 18 '13 at 07:42
  • I think that you should filter data after reading to decrease its size, lapply(list_of_files, function(fn, v=c("051", "041", "044", "54", "V0262")) subset(read.csv(fn, sep = "",as.is=TRUE),(A_1 %in% v)|(A_2 %in% v)|(A_3 %in% v))) – Wojciech Sobala Aug 18 '13 at 13:34
  • Subsetting to just what you need is a good idea indeed. Do note that using `subset` is discouraged outside an interactive session. – Paul Hiemstra Aug 18 '13 at 14:45
2

This may not be the most efficient or most elegant approach, but here is what I would do, based upon some assumptions where more info is missing; particularly, can't do any testing:

Make sure that RSQLite is installed (sqldf could be an option if you have enough memory, but personally I prefer having a "real" database that I also can access with other tools).

# make sqlite available
library( RSQLite )
db <- dbConnect( dbDriver("SQLite"), dbname = "hannah.sqlite" )

# create a vector with your filenames
filenames <- NULL
for (i in 201:225)
{
    for ( j in 1996:2007 )
    {
        fname <- paste( "D:/Hannah/CD/CD.R", i, "_cd", j, ".csv", sep="" ) 
        filenames <- c( filenames, fname )
    }
}

# extract the DB structure, create empty table
x <- read.csv( filenames[1], stringsAsFactor = FALSE, nrows = 1 )
dbWriteTable( db, "all", x, row.names = FALSE )
dbGetQuery( db, "DELETE FROM all" )

# a small table for your selection criteria (build in flexibility for the future)
I <- as.data.frame( c( "051", "041", "044", "54", "V0262" ) )
dbWriteTable( db, "crit", I, row.names = FALSE )

# move your 300 .csv files into that table
# (you probably do that better using the sqlite CLI but more info would be needed)
for( f in filenames )
{
    x <- read.csv( f, stringsAsFactor = FALSE )
    dbWriteTable( db, "all", x, append = TRUE, row.names = FALSE )
}

# now you can extract the subset in one go
extract <- dbGetQuery( db, "SELECT * FROM all 
                       WHERE A_1 IN (SELECT I FROM crit ) OR
                             A_2 IN (SELECT I FROM crit ) OR
                             A_3 IN (SELECT I FROM crit )"   )

This is not tested but should work (if not, tell me where it stops) and it should be faster and not run into memory problems. But again, without real data no real solution!

vaettchen
  • 7,299
  • 22
  • 41
  • > dbWriteTable( db, "all", x ) [1] FALSE Warning Messages: In value[[3L]](cond) : RS-DBI driver: (error in statement: near "all": syntax error) > dbGetQuery( db, "DELETE FROM all" ) Error in sqliteExecStatement(con, statement, bind.data) : RS-DBI driver: (error in statement: near "all": syntax error) – Vivian Aug 18 '13 at 08:23
  • `sqldf` uses memory by default but by specifying the name of a database using `dbname` argument of `sqldf` it will use an external database. Also note that `read.csv.sql` in the sqldf package uses a temporary external database by default rather than memory. – G. Grothendieck Aug 18 '13 at 09:35
  • Would need real data, ideally a `dput( x )`, to find out what is going wrong. What about a pastebin link or so? – vaettchen Aug 18 '13 at 10:52
2

You could also use function fread() from the data.table package. It's pretty fast compared to read.csv. Also, try to just loop over list.files().

beginneR
  • 3,207
  • 5
  • 30
  • 52
  • x=fread(filename, header = TRUE, stringsAsFactors=FALSE,sep="," ) it's can't do work. – Vivian Aug 18 '13 at 11:13
  • H=data.frame() for (i in 201:202){ for (j in 1996:1996){ filename=paste("D:/Hannah/CD/CD.R",i,"_cd",j,".csv", sep="") x=fread(filename, header = FALSE, stringsAsFactors=FALSE ) H=rbind(H,x) } } i try do two files. it can do and fast then old, but i don't know how to resolve the problem about that the colnames become v1 v2... than row one is the true name – Vivian Aug 18 '13 at 15:18