5

I have hundreds of .csv files I need to read in using fread and save as one data table. The basic structure is the same for each .csv. There is header info that needs to be skipped (easy using skip = ). I am having difficulty with skipping the last line of each .csv file. Each .csv file has a different number of rows.

If I have only one file in the Test folder, this script perfectly skips the first rows (using skip = ) and the last row (using nrows = ):

file <- list.files("Q:/Test/", full.names=TRUE)
all <- fread(file, skip = 7, select = c(1:7,9),
             nrows = length(readLines(file))-9)

When saving multiple files in the Test folder, this is the code I tried:

file <- list.files("Q:/Test/", full.names=TRUE)
L <- lapply(file, fread, skip = 7, select = c(1:7,9),
        nrows = length(readLines(file))-9)
dt <- rbindlist(L)

It doesn't create L and gives me this error:

Error in file(con, "r") : invalid 'description' argument

Any ideas on how to skip the last row of each .csv when each .csv has a different number of rows?

I am using data.table version 1.9.6. Thanks.

FG7
  • 469
  • 4
  • 14
  • 4
    Don't use `readLines`, that wastes a lot of effort. Try the approach here: http://stackoverflow.com/questions/3137094/how-to-count-lines-in-a-document – MichaelChirico Apr 11 '16 at 20:31
  • 4
    Perhaps `nrow` could use a negative value to skip lines from bottom of the file.. Filed [#1643](https://github.com/Rdatatable/data.table/issues/1643). – Arun Apr 11 '16 at 21:04
  • 4
    Maybe `head -n-1` passed to `fread` directly. Or a `grep -v` to remove the trailing footer text. See section 1 of [this new page](https://github.com/Rdatatable/data.table/wiki/Convenience-features-of-fread). – Matt Dowle Apr 11 '16 at 21:19
  • Also [this answer](http://stackoverflow.com/a/35786076/403310) might help. – Matt Dowle Apr 11 '16 at 21:22
  • @MichaelChirico I like this approach and am trying to work it out. I use Rstudio on Windows 7 so I believe I need to use Cygwin. So far I haven't been able to make it work. – FG7 Apr 12 '16 at 16:37
  • @Arun Thanks. It would be a great addition to data.table if not too difficult to implement. – FG7 Apr 12 '16 at 16:39
  • @MattDowle Thank you. From your other linked answers, I need to install Cygwin on my Windows 7 machine. Still working on getting it to work properly. I still only get error messages but I believe the issue is a Cygwin/Windows problem. Your suggestions should work. – FG7 Apr 12 '16 at 16:44
  • @FG7 did you add the Cygwin bin to your PATH? What are the error messages. Never heard of persistent problems before and it's widely used. – Matt Dowle Apr 12 '16 at 18:27

1 Answers1

6

It's a bit late, but here's what worked for me:

library(data.table)

fnames <- dir("path", pattern = "csv")

read_data <- function(z){
  dat <- fread(z, skip = 1, select = 1)
  return(dat[1:(nrow(dat)-1),])
}

datalist <- lapply(fnames, read_data)

bigdata <- rbindlist(datalist, use.names = TRUE)

Here path refers to the directory that you're looking into. I'm assuming that the names are similar for all read files, if not, you can always define a new name for bigdata using names. Hope this helps!

Gautam
  • 2,597
  • 1
  • 28
  • 51