0

I want to read in the first 3 columns from many files where I don't necessarily know the number of columns each file contains. Additionally, I don't exactly know the number of lines to skip in each file, though it won't be more than 19 before the header line.

My question is similar to these questions:

But I have the different problem of not knowing the number of columns in the files I want to import or the exact number of rows to skip. I only want to import the first three columns from every file, which are consistently named (Date/Time,Unit,Value).

The read.table solutions to the linked questions require knowing the number of columns in your file and specifying the colClasses for each column. I am attempting to read thousands of files via an approach with lapply, where the input is a list of .csv files, and use read.table on each file:

lapply(files, read.table, skip=19, header=T, sep=",")
# 2ndary issue: # of lines to skip varies.  maybe up to 19.

Is there a way of getting around the problem of not knowing the number of columns ahead of time?

EDIT: I have modified the answer provided by @asb to suit my problem and it works perfectly.

my.read.table <- function (file, sep=",", colClasses3=c("factor","factor","numeric"), ...) {

## extract the first line of interest, the line where "Date/Time,Unit,Value" appears
first.line <- readLines(file, n=20)[grepl("Date/Time,Unit,Value",
                                          readLines(file, n = 20)) == T]
## deteremine number of lines to skip (max possible = 19)
skip.number <- grep("Date/Time,Unit,Value", 
                    readLines(file, n=20), value=FALSE)-1
## Split the first line on the separator to find # of columns
ncols <- length(strsplit(first.line, sep, fixed=TRUE)[[1]])
## fixed=TRUE to avoid needing to escape the separator.

# use ncols here in the `colClasses` argument
out <- read.table(file, sep=sep, header=TRUE, skip = skip.number,
                  colClasses=c(colClasses3, rep("NULL", ncols - 3)), ...)
out
}
Community
  • 1
  • 1
Jota
  • 17,281
  • 7
  • 63
  • 93
  • 1
    Why not to do something like `lapply(list.files(pattern='som_patt'),full.names=TRUE,function(x)read.table(x)[,1:3])`? I mean you read all the table by you return only some columns. Or Use `cbc.read.table` from `colbycol` as mentioned in answers linked. – agstudy Jun 26 '13 at 22:35

1 Answers1

1

It is easy to know how many columns you have if you know your separator. You can use a construct such as this for each file:

my.read.table <- function (file, sep=",", colClasses3=rep('double', 3), ...) {

  first.line <- readLines(file, n=1)

  ## Split the first line on the separator.

  ncols <- length(strsplit(first.line, sep, fixed=TRUE)[[1]])
  ## fixed=TRUE is to avoid the need to escape the separator when splitting.

  out <- read.table(file, sep=sep,
                    colClasses=c(colClasses3, rep("NULL", ncols - 3)), ...)

  out
}

And then use your solution:

lapply(files, my.read.table, skip=19, header=TRUE)

Also, note that you will have to worry about whether you have rownames and colnames in your file or not because of some intelligence that read.table applies when rownames and colnames are present. The above solution is written assuming none. Please read about colClasses in ?read.table to tweak this further to suit your needs.

asb
  • 4,392
  • 1
  • 20
  • 30
  • I do have column names in the file, which is why I have `header=TRUE`. I'm working on this solution. I modified a few things in the function to suit my purposes, but I am getting the following error when I run it: `Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 1 did not have 13 elements` I can't tell which file is causing the error. – Jota Jun 27 '13 at 00:18
  • 1
    You can figure out the file for which the code is breaking by adding a `print(file)` at the beginning of `my.read.table`. – asb Jun 27 '13 at 00:33
  • 1
    Getting the number of columns with `readLines` can be problematic when you have row names and quoted fields; eg you don't want to read `"Smith, J"` as two columns when it's meant to be one. Better to use `read.table(*, nrows=1)` for this purpose. – Hong Ooi Jun 27 '13 at 01:41
  • @HongOoi Agreed! In fact, I debated the two approaches myself. And then chose this one for the simple fact that it reads a line! :\ – asb Jun 27 '13 at 02:26