21

I have a Pandas dataframe which has Encoding: latin-1 and is delimited by ;. The dataframe is very large almost of size: 350000 x 3800. I wanted to use sklearn initially but my dataframe has missing values (NAN values) so i could not use sklearn's random forests or GBM. So i had to use H2O's Distributed random forests for the Training of the dataset. The main Problem is the dataframe is not efficiently converted when i do h2o.H2OFrame(data). I checked for the possibility for providing the Encoding Options but there is nothing in the documentation.

Do anyone have an idea about this? Any leads could help me. I also want to know if there are any other libraries like H2O which can handle NAN values very efficiently? I know that we can impute the columns but i should not do that in my dataset because my columns are values from different sensors, if the values are not there implies that the sensor is not present. I can use only Python

ayaan
  • 715
  • 5
  • 18
  • 36
  • Xgboost can deal with missing values perfectly. – CrazyElf Oct 27 '17 at 09:52
  • @CrazyElf, Yeah i read about xgboost but the Problem is i cannot make the build from the source code because i donot have admin permissions to install git or mingw in my Laptop at workplace. but i can use pip, and unfortunately the Support for pip for xgboost is removed – ayaan Oct 27 '17 at 09:57
  • You can try to install xgboost from here: http://www.lfd.uci.edu/~gohlke/pythonlibs/#xgboost for me it works perfectly. – CrazyElf Oct 27 '17 at 10:12
  • @CrazyElf Thank you for the link i will try it immediately – ayaan Oct 27 '17 at 11:05
  • @CrazyElf: the wheel files in the link are for Windows, i am using Linux, i forgot to mention that – ayaan Oct 27 '17 at 11:09
  • Can't help you with linux, I'm not advanced in it, sorry :) – CrazyElf Oct 27 '17 at 11:13
  • What do you mean by "The main Problem is the dataframe is not efficiently converted when i do `h2o.H2OFrame(data)`"? I am not sure what the problem is, is there an error? – Erin LeDell Oct 29 '17 at 18:44
  • @ErinLeDell,there is'nt any explicit error thrown out, but rather the encoding is not correct and also the conversion takes lot of time to change the dataframe to H2o frame. I suspect that because of the size of the dataset this behavior is occuring. I wish H2o could provide possibility to use pandas dataframe as they are memory efficient – ayaan Oct 29 '17 at 21:27
  • @ayaan Pandas dataframes are not more memory efficient -- they can only be used on a single machine which makes them much more limited, memory-wise than H2OFrames. The `as.H2OFrame()` function has to write to disk to get the data from Python memory into Java memory and disk read/write is what takes a long time (does not have to do with memory). I'd recommend reading the data directly from disk into H2O using `h2o.import_file()` and skipping Pandas dataframes altogether. – Erin LeDell Oct 30 '17 at 20:38
  • @ErinLeDell Thank you i will try to read it directly from the disk, instead from pandas dataframe – ayaan Oct 30 '17 at 21:01
  • Typo in my comment above: `as.H2OFrame()` should be `h2o.H2OFrame()`. – Erin LeDell Oct 31 '17 at 01:11

2 Answers2

36
import h2o
import pandas as pd

df = pd.DataFrame({'col1': [1,1,2], 'col2': ['César Chávez Day', 'César Chávez Day', 'César Chávez Day']})
hf = h2o.H2OFrame(df)

Since the problem that you are facing is due to the high number of NANs in the dataset, this should be handled first. There are two ways to do so.

  1. Replace NAN with a single, obviously out-of-range value. Ex. If a feature varies between 0-1 replace all NAN with -1 for that feature.

  2. Use the class Imputer to handle NAN values. This will replace NAN with either of mean, median or mode of that feature.

Anand C U
  • 885
  • 9
  • 29
  • @ Anand CU. I was exactly doing the same Thing but i am thinking that because of the large size of the dataframe the type conversion is not efficient – ayaan Oct 27 '17 at 09:54
  • @ayaan Do you have too many NANs in your dataset? What percentage of the values are NANs in the dataset? – Anand C U Oct 27 '17 at 09:57
  • yeah around 40% percent of my dataframe is NAN – ayaan Oct 27 '17 at 09:57
  • @ayaan One option would be to replace `nan` with a single, obviously out-of-range value. Ex . If a feature varies between `0-1` replace all `nan` with `-1` for that feature. Then You can possibly use sklearn's algorithms as well – Anand C U Oct 27 '17 at 10:00
  • @ayaan Or you could also use the class [`Imputer`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html) to handle `NAN` values – Anand C U Oct 27 '17 at 10:04
  • I cannot use Imputer for my case because NAN values means the sensor is not applicable and i Need to Keep it as it is, i cannot replace with mean or median. Your previous Suggestion of using something like a outlier works, but i would also like to try with the NAN values – ayaan Oct 27 '17 at 11:07
  • I know this is not a perfect solution but, If it's only to check the accuracy and the algorithm, you can always do it in the above way. – Anand C U Oct 27 '17 at 11:33
  • Yes definitely, i finished working on doing the outlier method already, Just wanted to use once with the NAN values – ayaan Oct 27 '17 at 12:13
7

If there are large number of missing values in your data and you want to increase the efficiency of conversion, I would recommend explicitly specifying the column types and NA strings instead of letting H2O interpret it. You can pass a list of strings to be interpreted as NAs and a dictionary specifying column types to H2OFrame() method.

It will also allow you to create custom labels for the sensors that are not present, instead of having a generic "not available" (impute NaN values with a custom string in pandas).

import h2o    

col_dtypes = {'col1_name':col1_type, 'col2_name':col2_type}
na_list = ['NA', 'none', 'nan', 'etc']

hf = h2o.H2OFrame(df, column_types=col_dtypes, na_strings=na_list)

For more information - http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/_modules/h2o/frame.html#H2OFrame

Edit: @ErinLeDell 's suggestion to use h2o.import_file() directly with specifying column dtypes and NA string will give you the largest speed-up.

karhayrap
  • 346
  • 1
  • 4