I’m trying to replicate some basic linear regression results from sci-kit learn’s LASSO implementation into various pyTorch linear and non-linear models (dense & LSTM) and finding that the solution quality is nowhere near as good. I’ve tried this two separate ways, first by explicitly adding a penalty to the loss function, similar to what is described here: python - Pytorch: how to add L1 regularizer to activations? - Stack Overflow
The second way I’ve tried is using AdamW and effectively treating lambda (weight decay) the same way it’s used in LASSO (creating curves of train and cross-validation error as a function of lambda).
In both cases, I get sub-optimal results vs. sci-kit learn (i.e. the validation and test set objective functions are higher). Also, I get a substantial level of noise in the training, even with a lower learning rate (1.0e-4) and a fairly high weight decay:
Any thoughts on how to get the neural network models closer to the linear regressions results and smooth out the optimizer solutions?