0

Currently, I got a position to work as a data scientist on ML. my question is as follows, is it possible to train an algorithm directly from mySQL database and is there a similarity with the way you train it from an csv file. moreover, I would like to know if you are working on very unbalanced dataset. when you use for instance 0.2 percentage of the data for testing, does it divides the proportion of the negative and positive cases in the training and the testing in equal proportion. Can any one propose me either a good tutorial or documentation?

abraham foto
  • 437
  • 6
  • 16
  • Downvoting since this question has multiple unrelated parts, and the last one (request for a tutorial or documentation) is both opinion-based and unclear; it's not even clear what you're asking for a tutorial about. – Silenced Temporarily Apr 04 '18 at 18:47

1 Answers1

0

Sure you can train your model, directly from the database. This is what happens all around in production systems. Your software should be designed, that is does not matter if your data source is SQL, csv or whatever. As you don´t mention the programming language, it is hard to say, how to do it, but in python you can take a look here: How do I connect to a MySQL Database in Python?

If your data set is unbalanced, like it is often in reality, you can use class weights to make your classifier aware of that. e.G. in keras/sci-kit learn you can just pass the class_weights parameter. Be aware that if your data set is too small, you can run into problems with default measures like accuracy. Better take a look at the confusion matrix or other metrics like the Matthews correlation coefficient

Another good reference: How does the class_weight parameter in scikit-learn work?

ixeption
  • 1,972
  • 1
  • 13
  • 19
  • Tnx, you got my question correctly and you are right and your answer helps a lot, iam using python. the Dataset is too big to be handled with csv file (2 - 5 GB). how about the data tranformation, i have a dataset with 31 columns, and 25 of them are string or object types. i want to encode all of them with sklearn. labelEncoder and then with OneHotEncoder, but it complains that input shape(921178,25). i think too big. is there different way of transforming the attributes, or or any suggestions? tnx in advance! – abraham foto Apr 06 '18 at 10:53
  • What Label and OneHot encoder does, is to give every string, which is encountered in the data a unique id. If you are dealing with more than single words in the data, you will basically end up with a huge dimension, which does also not help in terms of classification. So if you are dealing with text, you need to transform your text data using nlp techniques. Take a look at [BagOfWords](http://scikit-learn.org/stable/modules/feature_extraction.html#the-bag-of-words-representation) as a first idea. – ixeption Apr 06 '18 at 12:37