0

Iam doing the Bulldozer price calculation problem, using RandomForestRegressor.After removing all the missing values and converting all data into numeric, I try to fit and train the data into a model. The data set is pretty large about 412698 rows × 57 columns and using a 3gb Ram device.

here is my code

%%time
# Instantiate model
model = RandomForestRegressor(n_jobs=-1,
                             random_state=42)

# Fit the model 
model.fit(df_tmp.drop("SalePrice", axis=1), df_tmp["SalePrice"])

The data set is available in Kaggle and I am also attaching its link.. https://www.kaggle.com/c/bluebook-for-bulldozers/data

Rohith V
  • 1,089
  • 1
  • 9
  • 23
  • my recommendations 1. try to convert the pandas df to numpy array. 2. If its not fit into memory try to reduce the input size. 3. RandomForest will generate a tree with lots and lots of branch so by default, it will occupy too much memory. so try to use proper features – Subbu VidyaSekar Jul 07 '20 at 05:44

1 Answers1

0

You can use Batch processing when you have more data than your ram can handle. Sklearn is inbuilt with this feature you have to use the warm_start parameter in RandomForestRegressor

warm_start, default=False When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest

You can try something like this

import numpy as np

model = RandomForestClassifier(warm_start = True, n_jobs=-1,random_state=42)

for df_split in np.array_split(df_temp, 500): # split into 500 dataframes 
   model.fit(df_split.drop("SalePrice", axis=1), df_split["SalePrice"])

if you have memory errors while loading pandas data frame itself check this

you can always use google colab for better ram and gpu. It is free and very easy to start with

Ajay Chinni
  • 780
  • 1
  • 6
  • 24