3

I have a csv file which is more than 20gb. I can read the first few lines using readlines and then figure out which columns I want to import. Is it possible to import only these columns using h2o.importFile() or some other way in h2o so that I am not loading unnecessary columns?

deepAgrawal
  • 673
  • 1
  • 7
  • 25

1 Answers1

3

The h2o.importFile() function does not support loading only a subset of the columns. Here are some work-arounds:

  • Load in the entire dataset and use the x argument in any modeling function to ignore certain columns. fit <- h2o.gbm(x = good_cols, y = y, training_frame = train)
  • Load in the entire dataset and then create a new H2OFrame which only contains the columns you want. newdf <- df[, good_cols]
  • Create a copy of your data on disk that contains only the columns you want. This is easy to do using the cut tool (example here). cut -d, -f2-4,6-10 train.csv > newtrain.csv
Erin LeDell
  • 8,704
  • 1
  • 19
  • 35
  • 1
    +1 for `cut` (I am amazed I never knew about that command!). I guess the point of the question was "20gb is too big to fit in memory", so the other solutions won't be usable. – Darren Cook May 31 '18 at 17:41