Import selected columns from csv in h2o

Question

I have a csv file which is more than 20gb. I can read the first few lines using readlines and then figure out which columns I want to import. Is it possible to import only these columns using h2o.importFile() or some other way in h2o so that I am not loading unnecessary columns?

score 3 · Accepted Answer · answered May 30 '18 at 18:19

The h2o.importFile() function does not support loading only a subset of the columns. Here are some work-arounds:

Load in the entire dataset and use the x argument in any modeling function to ignore certain columns. fit <- h2o.gbm(x = good_cols, y = y, training_frame = train)
Load in the entire dataset and then create a new H2OFrame which only contains the columns you want. newdf <- df[, good_cols]
Create a copy of your data on disk that contains only the columns you want. This is easy to do using the cut tool (example here). cut -d, -f2-4,6-10 train.csv > newtrain.csv

+1 for `cut` (I am amazed I never knew about that command!). I guess the point of the question was "20gb is too big to fit in memory", so the other solutions won't be usable. — Darren Cook, May 31 '18 at 17:41

Import selected columns from csv in h2o

1 Answers1