0

The raw data shape is (200000 * 15) but after pre-processing the data and applying OneHotEncoding the data dimension has got increased to (200000 * 300).

The data needs to be trained with Linear Regression, XGBoost and RF for predictive modeling. Earlier LabelEncoder had been used and the results are not satisfactory.

(200000 * 300) is consuming a whole lot of RAM and slapping MemoryError while training the data.

  • Running on Jupyter Notebook AWS with 16 gb RAM
  • Using sklearn for most of the ML part
  • Data is in csv format (loaded as DataFrame in Python)

Would appreciate any suggestion !

Tomasz Giba
  • 507
  • 1
  • 8
  • 22
Soumyaansh
  • 8,626
  • 7
  • 45
  • 45
  • try this: https://stackoverflow.com/a/4285292/2075165 here is a simple python memory capacity test. Maybe it will give you an idea what is too much memory. I understand that "15" is number of columns. What is stored in these columns? Or, how much data is used by one row? If you would calculate that then multiplying this by 200 000 would give you an idea how much memory is being consumed. Also, how big is this `csv` file. Maybe it takes too much and you should process it in chunks. – Tomasz Giba Dec 07 '17 at 14:02
  • Thanks for the link @TomaszGiba, the csv is of 400 mb which is relatively not that big but as I mentioned the dimension(columns) has been increased because of applying OneHotEncoding which replaces all the categorical values to either 1 or 0. I will go through the link thanks a lot – Soumyaansh Dec 07 '17 at 14:23
  • 1
    Yeah, that is not much. Even if columns are extrapolated from 15 to 300 it is still one bit per column. So, 200 000 * 300 bits is 0,01 GB and plus 400 mb for file you should still have tons of RAM left. Something is fishy. – Tomasz Giba Dec 07 '17 at 14:26
  • Maybe convert to sparse matix before processing – Vivek Kumar Dec 11 '17 at 09:07
  • Where is the actual memory error happening? What stage of your training? – rayryeng Dec 15 '17 at 10:13

0 Answers0