2

I am working with a huge H2OFrame (~150gb, ~200 million rows), which I need to manipulate a little. To be more specific: I have to use the frame's ip column, to find the location/city names for each IP and add this information to each of the frame's rows.

Converting the frame to a plain python object and manipulating it locally is not an option, due to the huge size of the frame. So what I was hoping I could do is to use my H2O cluster to create a new H2OFrame city_names using the original frame's ip column and then merge both frames.

My question is kind of similar to the question posed here, and what I gathered from this question's answer is that there is no way in H2O to do complex manipulations of each of the frame's rows. Is that really the case? H2OFrame's apply function only accepts a lambda without custom methods after all.

One option I thought of was to use Spark/Sparkling Water for this kind of data manipulation and then convert the spark frame to an H2OFrame to do the machine learning operations. However, if possible I would prefer to avoid that and only use H2O, not least due to the overhead that such a conversion creates.

So I guess it comes down to this: Is there any way to do this kind of manipulation using only H2O? And if not is there another option to do this without having to change my cluster architecture (i.e. without having to turn my H2O cluster into a sparkling water cluster?)

ksbg
  • 3,214
  • 1
  • 22
  • 35

2 Answers2

1

Yes, when using apply with H2OFrame, you can not pass a function instead only lambda is accept. For example if you try passing tryit function you will get the following error showing the limitation:

H2OValueError: Argument `fun` (= <function tryit at 0x108d66410>) does not satisfy the condition fun.__name__ == "<lambda>"

As you already know Sparkling Water is another option to perform all the data munging first in spark and then push you data into H2O for ML.

If you want to stick with H2O as it is, then your options are to just loop through the dataframe to process elements your way. The following option could be little time consuming depending on your data however it does not ask you to move your environment.

  • Create a new H2O frame by selecting your "ip" column only and add location, city, and other empty columns to it with NA.
  • Loop through all the ip values and based on "ip", find location/city and add location, city and other column values to the existing columns
  • Finally cbind the new h2oFrame with original H2OFrame
  • Check "ip" and "ip0" columns for proper merge with 100% match and then remove one of the duplicate "ip0" column.
  • Remove the other extra H2OFrame to save memory
AvkashChauhan
  • 20,495
  • 3
  • 34
  • 65
1

If your ip --> city algorithm is a lookup table, you could create that as a data frame, then use h2o.merge. For an example, this video (starting at around the 59min mark) shows how to merge weather data into the airlines data.

For ip addresses I imagine you might want to first truncate to the first two or three parts.

If you don't have a lookup table, it becomes interesting as to whether it is quicker to turn a complex algorithm into that lookup tree and do the h2o.merge, or stick with downloading your huge data in batches, running locally in client, uploading a batch of answers, and doing h2o.cbind at the end.

BTW, the cool and trendy approach would be to sample 1 million of your ip addresses, lookup the correct answer on the client to make a training data set, then use h2o to build a machine learning model. You can then use h2o.predict() to create the new city column in your real data. (You will want to at least split ip address into 4 columns first, though.) (My hunch is a deep random forest would work best... but I would definitely experiment a bit.)

Darren Cook
  • 27,837
  • 13
  • 117
  • 217