0

I have a dataset of over 5GBs. Is there a way I can train my model with this data chunk by chunk in a Stochastic Gradient Descent kind of way? In other words, break the set in 5 chunks of 1 GB each, and then train parameters.

I want to do this in a Python environment.

Merlin
  • 24,552
  • 41
  • 131
  • 206
Arslán
  • 1,711
  • 2
  • 12
  • 15
  • 3
    Did you read this? http://stackoverflow.com/questions/17710748/process-large-data-in-python – Harish Talanki Jul 08 '16 at 21:33
  • The link above has very little to do the question. This is a machine-learning problem, not a data processing problem. – Merlin Jul 08 '16 at 22:04
  • The Question is not too board. Machine Learning algo's have an underlying process that does or does not lend itself to chunking the data. Some Scikit Learn algo's have "recently" been implemented to work with partial data sets, others have not. Some of the same type of algo implementation in other languages or packages can or can not take partials. So, which python package accepts partials is key to determine if you can use chunking to run code in parallel. – Merlin Jul 08 '16 at 22:53

1 Answers1

1

Yes, you can. SGD in scikit learn has partial fit ; use it with your chunks

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier

partial_fit(X, y[, classes, sample_weight]) Fit linear model with Stochastic Gradient Descent.
Merlin
  • 24,552
  • 41
  • 131
  • 206