fread apply select and colClasses after check.names?

Question

I'm trying to read in a large amount of data (up to 100 files with sizes up to 1.5GB each) which are in a slightly annoying format, and each vary slightly. For speed reasons I want to use data.table::fread but I have a number of problems:

The input data (which I have no control over the format) has a number of columns named the same thing
The ID column (and possibly others) appears as a numeric, but is actually a character or factor column - I need the leading 0s to remain, so I can't just covert back after import
I only want some columns, and with so much data would prefer not to import and then dismiss
Each file has slightly different number and name columns. The columns I want are easy to find with a regex statement and I will always end up with the same amount.

My plan of attack was to import all columns and find appropriate columns with regex, then use select on those columns within fread. But now I am stuck with assigning colClasses as these are assigned before the selection of columns and also before the names are checked so even using a named list doesn't work. Is there a way to apply the colClasses after the select/check.names without loosing my leading zeros?

I tried the naming column techniques from using colClasses in fread and also reviewed Using colClasses and select arguments of fread simultaneously but neither can deal with differences in my files

Reproducible example:

dt <- data.frame(ID = c("01","02","03"), HH = 1:3, MM = rep(0,3), HH = 2:4, MM = rep(0,3),Precipx = rnorm(3),
             other1 = rep(0,3), other2 = rep(1,3),check.names = F)
write.csv(dt, "test.csv", row.names = F, quote = F)

Colnames <- names(fread("test.csv",nrows = 0 ,check.names = T))
ColNos <- grp(c("ID|HH.1|MM.1|$Precip"),Colnames)
#This import works, but I lose leading 0s
dat <- fread("test.csv", check.names = T, select = ColNos)

#This tells me I have the wrong number of `colClasses`, but I cannot set for all columns as varies file to file
dat <- fread("test.csv", check.names = T, select = ColNos, colClasses = c("character","charcter","character","numeric"))

#This doesn't recognise that I want the second HH column. Using just `"HH"` also has this problem
# and "Precipx" will sometimes be "Precipy", "Precipz"... in the file
dat<- fread("test.csv", check.names = T, select = ColNos, 
  colClasses = c("ID" = "character","HH.1" = "charcter","MM.1" = "character","Precipx" = "numeric"))

Since your selection of columns is complex, a general strategy would be to read in the header first (e.g., set nrows = 0 in fread), and programatically choose the column numbers and colClasses you need. — thc, May 31 '18 at 05:07
@Rohit There are a number of columns that I always want, for example the second HH and MM columns. And then there's a number of identifiers such as all columns starting with 'Precip', in the real example there's more of both these kinds. Is that what you mean? — Sarah, Jun 01 '18 at 06:18
@thc that's basically what I am doing at the moment. Also setting colClasses this way isn't ideal as the names of the columns I have varies, so I can't use named columns which means I have to set a list the length of ncols and change each type I want to change. I was hoping there might be a better solution — Sarah, Jun 01 '18 at 06:21
I would regard the select parameter using column names as a convenience parameter -- if you have multiple columns named the same, how would it know which column you want? I think the approach you have now is the best solution. It also gives you a chance to select only columns you care about, which makes reading in those huge tables faster. — thc, Jun 01 '18 at 16:51
You're probably reading the right columns with your last command. The name check happens after the columns are read, so the column will appear as HH, not HH.1, but it will have the data specified by `ColNos` — Rohit, Jun 02 '18 at 05:22

fread apply select and colClasses after check.names?

0 Answers0