I have a massive csv file but I only use a small subset of its columns in my analysis. To save time and memory space, I would like to load just the necessary columns. I tried using the colClasses
method of read.csv
as suggested here but I couldn't make it work.
Let me describe the issue with a MWE. Suppose my data (the csv file) is created by the following:
df <- data.frame(a = c('3', '4'), b = c(5, 6))
write.csv(x = df, file = 'df.csv', row.names = F)
In the csv, column a is saved as text, while b is saved as numeric. I would like to load only column a for my analysis. My idea is to just get the column types to form a colClasses vector. To do this, I load just the first row of the data (which is fast, and in practice I have 1M+ rows) retrieve the column types and create a vector to be passed to colClasses
:
df <- read.csv(file = 'df.csv', nrows = 1) # read just first row
cols <- colnames(df) # column names
coltypes <- sapply(df, class) # column types
wanted_cols <- c('a') # column names needed for analysis
cc <- rep('NULL', length(cols)) # initialize colClasses vector
cc[cols %in% wanted_cols] <- coltypes[cols %in% wanted_cols] # put the needed types into cc
data <- read.csv(file = 'df.csv', colClasses = cc) # load all rows but just needed columns
However, when R loads the data through read.csv
(first line) it sees only integers in column a and automatically converts it to an integer type. When I feed back this type into the colClasses
argument, it cannot load the data because a is stored as a string in the csv. I get:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, :
scan() expected 'a real', got '"3"'
Another problem that also arises is that by loading just the first row of data to get the column types, I may not give R enough information. If some column's first element is 1, it looks like R thinks it's a logical type, while it could in fact be a lot of other types.
Is there a way to make this work? Or is there a different technique that would enable me to load certain columns based on their names?