0

I collected 8800 samples total, but after data cleaning and outlier detection I was left with 3507 samples.

Is this enough to put through machine learning models? (lasso, linear regression, decision tree)
Should I scrape more?

I expect more data is needed, but I want to check with others before wasting time.

Also, how much data should I use for training and testing?

greybeard
  • 2,249
  • 8
  • 30
  • 66

1 Answers1

1

When it comes to Machine Learning, more data is always better

In general, as your model gets more complex, you'll need more data to prevent overfitting.

For example, a single-variable linear regression requires less data to train than a convolutional neural network. This is because the neural network has more weights than the single-variable model.

Unfortunately, a simple model has less predictive power than a complex one. In our example, this means the linear regression will yield a prediction farther from the actual value than a neural network when trying to model a variable that depends on more than the single input.

As for train/test split, I recommend randomly ordering all the data, and then using 80% for training and 20% for testing. Repeating this process multiple times to check if your model is a good fit regardless of training data selected is called K-Fold Cross Validation

davetherock
  • 224
  • 1
  • 2
  • 12