0

I am new to the data science and i want to explore the relationship between data .. I have a very large dataset containing 556784 X 60 rows and columns . There are some unwanted variable to ignore to feed to the neural network . Using Linearregression && Multipleregression can help us to find the relationship between Xlabel and Ylabel . But running regression technique in such huge dataset really helps ? or there any other ways to find which data is really important to the problem and which data not ?

I know this a theory question but it really helps me to further proceed . Thanks!

Madhi
  • 1,206
  • 3
  • 16
  • 27

1 Answers1

1

I'm also a noob in DS, but I think I can give you some ideas:

  • The way you treat your data depends on what kind of data you are working with(is in numbers, text, or some kind of time-series)
  • It is a good idea to explore it by yourself with making some plots.
  • You can use a reasonably small part of your data to reduce computation time.
  • Is there really need in NN? It gives quite unclear results which are hard to interpret and takes time to train, maybe you should try to start with "classic" models first and make some good feature engineering.
  • Finally, you can check sklearn manual (which I find really good) for data preprocessing chapter, I think it would give you some ideas to try with:

http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing

I hope some of this will be helpful.

  • Probably you can find more help if you can share data example. Just in case you will decide to make it: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – George Vdovychenko Dec 29 '17 at 11:08