I'm trying to read in a large amount of data (up to 100 files with sizes up to 1.5GB each) which are in a slightly annoying format, and each vary slightly. For speed reasons I want to use data.table::fread
but I have a number of problems:
- The input data (which I have no control over the format) has a number of columns named the same thing
- The ID column (and possibly others) appears as a numeric, but is actually a character or factor column - I need the leading 0s to remain, so I can't just covert back after import
- I only want some columns, and with so much data would prefer not to import and then dismiss
- Each file has slightly different number and name columns. The columns I want are easy to find with a regex statement and I will always end up with the same amount.
My plan of attack was to import all columns and find appropriate columns with regex, then use select
on those columns within fread
. But now I am stuck with assigning colClasses
as these are assigned before the selection of columns and also before the names are checked so even using a named list doesn't work. Is there a way to apply the colClasses
after the select
/check.names
without loosing my leading zeros?
I tried the naming column techniques from using colClasses in fread and also reviewed Using colClasses and select arguments of fread simultaneously but neither can deal with differences in my files
Reproducible example:
dt <- data.frame(ID = c("01","02","03"), HH = 1:3, MM = rep(0,3), HH = 2:4, MM = rep(0,3),Precipx = rnorm(3),
other1 = rep(0,3), other2 = rep(1,3),check.names = F)
write.csv(dt, "test.csv", row.names = F, quote = F)
Colnames <- names(fread("test.csv",nrows = 0 ,check.names = T))
ColNos <- grp(c("ID|HH.1|MM.1|$Precip"),Colnames)
#This import works, but I lose leading 0s
dat <- fread("test.csv", check.names = T, select = ColNos)
#This tells me I have the wrong number of `colClasses`, but I cannot set for all columns as varies file to file
dat <- fread("test.csv", check.names = T, select = ColNos, colClasses = c("character","charcter","character","numeric"))
#This doesn't recognise that I want the second HH column. Using just `"HH"` also has this problem
# and "Precipx" will sometimes be "Precipy", "Precipz"... in the file
dat<- fread("test.csv", check.names = T, select = ColNos,
colClasses = c("ID" = "character","HH.1" = "charcter","MM.1" = "character","Precipx" = "numeric"))