I have a Pandas dataframe which has Encoding: latin-1
and is delimited by ;
. The dataframe is very large almost of size: 350000 x 3800
. I wanted to use sklearn initially but my dataframe has missing values (NAN values
) so i could not use sklearn's random forests or GBM. So i had to use H2O's
Distributed random forests for the Training of the dataset. The main Problem is the dataframe is not efficiently converted when i do h2o.H2OFrame(data)
. I checked for the possibility for providing the Encoding Options but there is nothing in the documentation.
Do anyone have an idea about this? Any leads could help me. I also want to know if there are any other libraries like H2O which can handle NAN values very efficiently? I know that we can impute the columns but i should not do that in my dataset because my columns are values from different sensors, if the values are not there implies that the sensor is not present. I can use only Python