2

I have an H2O frame R object like this

h2odf

A | B | C | D
--|---|---|---
1 | NA| 2 | 0
2 | 1 | 2 | 0
3 | NA| 2 | 0
4 | 3 | 2 | 0

I want to remove all those rows where B is NA (1st and 3rd row). I have tried

na <- is.na(h2odf[,"b"])
h2odf <- h2odf[!na,]

and

h2odf <- h2odf[!is.na(h2odf$B),]

and

h2odf <- subset(h2odf, B!=NA)

This works for R Dataframe but not H2O. Giving this error:

Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page,  : 


ERROR MESSAGE:

DistributedException from localhost/127.0.0.1:54321: 'Cannot set illegal UUID value'

Desired output is

h2odf

A | B | C | D
--|---|---|---
2 | 1 | 2 | 0
4 | 3 | 2 | 0

One option I have is to convert it into R Dataframe, remove rows and convert it back to H2O frame. But that is taking long time because input file size is close to 4.5 GB. Is it possible to do this in H2O frame hex object itself?

I am running Rstudio on aws cluster.

penguin
  • 1,267
  • 14
  • 27
  • have you tried the subset function? idk what h20 frames are, but its very simple syntactically – 3pitt Sep 08 '17 at 13:22
  • Yes. I have tried this h2odf <- subset(h2odf, B!=NA). Not working. H2O is a platform that makes it faster to apply machine learning algorithms on big data. Doing this using normal R dataframes is very slow. I am using its R library. https://cran.r-project.org/web/packages/h2o/h2o.pdf . https://www.h2o.ai/h2o/ – penguin Sep 08 '17 at 13:36
  • oh yeah you need to use subset(h2odf, !is.na(B)) or a column of B perhaps – 3pitt Sep 08 '17 at 13:37
  • Thanks but I have already tried this. Not working. – penguin Sep 08 '17 at 13:41
  • if neither the subset function nor x[bool,] approach works, then its probably something specific to this h2odf data type. have you looked at https://stackoverflow.com/questions/27181616/subsetting-in-h2o-r#27296668 – 3pitt Sep 08 '17 at 13:42
  • I am running this on RStudio on aws cluster. Getting this error: Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, : ERROR MESSAGE: DistributedException from localhost/127.0.0.1:54321: 'Cannot set illegal UUID value' – penguin Sep 08 '17 at 13:47
  • I have tried everything mentioned here https://stackoverflow.com/questions/8005154/conditionally-remove-dataframe-rows-with-r – penguin Sep 08 '17 at 13:48
  • can you just convert the object to a data frame? – 3pitt Sep 08 '17 at 13:58
  • Yes. That is possible. But it will take too much time. My input file is more than 4GB. Reading into R dataframe is very slow compared to h2o frame. – penguin Sep 08 '17 at 14:01

1 Answers1

1
> class(h2odf)
[1] "H2OFrame"

> h2odf
  A  B C D
1 1 NA 2 0
2 2  1 2 0
3 3 NA 2 0
4 4  3 2 0

[4 rows x 4 columns] 

> h2odf[!is.na(as.numeric(as.character(h2odf$B))),]
  A B C D
1 2 1 2 0
2 4 3 2 0

[2 rows x 4 columns]
Sagar
  • 2,778
  • 1
  • 8
  • 16
  • Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, : – penguin Sep 08 '17 at 13:45
  • @penguin - Never worked on H2O. My answer might not be valid for your query in that case. I can take it off so others can still assist you. – Sagar Sep 08 '17 at 13:59
  • Thanks. I am not sure whether you should remove it. This works for default R dataframe object. I was hoping that it will work for r h2o frame object also as most of R operations work. – penguin Sep 08 '17 at 14:04
  • This is a similar issue but still unresolved https://stackoverflow.com/questions/27181616/subsetting-in-h2o-r#27296668 – penguin Sep 08 '17 at 14:04
  • Can you show how to create that `H2O frame`? I have downloaded the required package. – Sagar Sep 08 '17 at 14:07
  • To convert an existing dataframe to H2O frame use df <- as.h2o(h2odf). To read directly from csv, use df <- h2o.importFile(path = filepath) – penguin Sep 08 '17 at 14:12
  • After converting to a `H2OFrame` and running the same commands, I still get the output you are looking for. Please see updated code above. – Sagar Sep 08 '17 at 14:22
  • Ok. I think I was able to reproduce your issue. Because you have `NA` explicitly in your `H2OFrame`, it is being treated as `character`. Please see the updated code above. Hope it helps. – Sagar Sep 08 '17 at 14:27
  • I'm getting this error: Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, : ERROR MESSAGE: Object 'RTMP_sid_b571_28' not found for argument: key – penguin Sep 08 '17 at 14:33
  • I think I will need the exact data set to reproduce this particular issue. – Sagar Sep 08 '17 at 14:35
  • Thanks but actual data is confidential. I am running rstudio on aws cluster. Column B is numeric. – penguin Sep 08 '17 at 14:39
  • I won't be able to reproduce the issue in that case. Gave a try though. – Sagar Sep 08 '17 at 14:49
  • Thanks for that :) – penguin Sep 08 '17 at 15:54