0

I have a massive csv file but I only use a small subset of its columns in my analysis. To save time and memory space, I would like to load just the necessary columns. I tried using the colClasses method of read.csv as suggested here but I couldn't make it work.

Let me describe the issue with a MWE. Suppose my data (the csv file) is created by the following:

df <- data.frame(a = c('3', '4'), b = c(5, 6))
write.csv(x = df, file = 'df.csv', row.names = F)

In the csv, column a is saved as text, while b is saved as numeric. I would like to load only column a for my analysis. My idea is to just get the column types to form a colClasses vector. To do this, I load just the first row of the data (which is fast, and in practice I have 1M+ rows) retrieve the column types and create a vector to be passed to colClasses:

df <- read.csv(file = 'df.csv', nrows = 1)  # read just first row
cols <- colnames(df)  # column names
coltypes <- sapply(df, class)  # column types
wanted_cols <- c('a')  # column names needed for analysis
cc <- rep('NULL', length(cols))  # initialize colClasses vector
cc[cols %in% wanted_cols] <- coltypes[cols %in% wanted_cols]  # put the needed types into cc
data <- read.csv(file = 'df.csv', colClasses = cc)  # load all rows but just needed columns

However, when R loads the data through read.csv (first line) it sees only integers in column a and automatically converts it to an integer type. When I feed back this type into the colClasses argument, it cannot load the data because a is stored as a string in the csv. I get:

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = 
dec,  : 
scan() expected 'a real', got '"3"'

Another problem that also arises is that by loading just the first row of data to get the column types, I may not give R enough information. If some column's first element is 1, it looks like R thinks it's a logical type, while it could in fact be a lot of other types.

Is there a way to make this work? Or is there a different technique that would enable me to load certain columns based on their names?

Ben
  • 429
  • 4
  • 11

1 Answers1

0

Found another solution: use fread(file, select = column_names) from data.table. You can specify column indices or names to the select argument to load only these columns.

Ben
  • 429
  • 4
  • 11