How much data do I need to properly train my ML model?

Question

I collected 8800 samples total, but after data cleaning and outlier detection I was left with 3507 samples.

Is this enough to put through machine learning models? (lasso, linear regression, decision tree)
Should I scrape more?

I expect more data is needed, but I want to check with others before wasting time.

Also, how much data should I use for training and testing?

How many attributes does each sample have? (# of dependant variables) — davetherock, Jun 12 '23 at 23:42

score 1 · Accepted Answer · answered Jun 12 '23 at 23:41

When it comes to Machine Learning, more data is always better

In general, as your model gets more complex, you'll need more data to prevent overfitting.

For example, a single-variable linear regression requires less data to train than a convolutional neural network. This is because the neural network has more weights than the single-variable model.

Unfortunately, a simple model has less predictive power than a complex one. In our example, this means the linear regression will yield a prediction farther from the actual value than a neural network when trying to model a variable that depends on more than the single input.

As for train/test split, I recommend randomly ordering all the data, and then using 80% for training and 20% for testing. Repeating this process multiple times to check if your model is a good fit regardless of training data selected is called K-Fold Cross Validation

How much data do I need to properly train my ML model?

1 Answers1