0

I have a .csv file that contains my data. I would like to do Logistic Regression, Naive Bayes and Decision Trees. I already know how to implement these.

However, my teacher wants me to split the data in my .csv file into 80% and let my algorithms predict the other 20%. I would like to know how to actually split the data in that way.

diabetes_df = pd.read_csv("diabetes.csv")
diabetes_df.head()

with open("diabetes.csv", "rb") as f:
    data = f.read().split()
    train_data = data[:80]
    test_data = data[20:]

I tried to split it like this (sure it isn't working).

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
  • Mention the type of data contains in csv – Rachit kapadia Apr 26 '18 at 10:07
  • Look here -> http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html – BcK Apr 26 '18 at 10:09
  • The file contains data to predict the outcome of diabetes based on the following features: Number of pregnancies, glucose, blood pressure, insulin, BMI,... it's all numerical data and it's labeled. There are 2 possibles outcomes (1 = have diabetes and 0 = does not have diabetes). – Georges Ridgmont Apr 26 '18 at 10:09
  • Do you want the 80% first lines to go into learning set and the last 20 % in the test set, or do you need/accept a more flat repartition? Anyway, that is a pure text line processing question... – Serge Ballesta Apr 26 '18 at 10:25

2 Answers2

4

Workflow

  1. Load the data (see How do I read and write CSV files with Python? )
  2. Preprocess the data (e.g. filtering / creating new features)
  3. Make the train-test (validation and dev-set) split

Code

Sklearns sklearn.model_selection.train_test_split is what you are looking for:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=0)
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
-1
splitted_csv = "value1,value2,value3".split(',')
print(str(splitted_csv)) #["value1", "value2", "value3"]
print(splitted_csv[0]) #value1
print(splitted_csv[1]) #value2
print(splitted_csv[2]) #value3

There are also libraries that parse csv and allow you to access at value by column name, but from your example i thought that you need some "low level" way to do it

andrexorg
  • 77
  • 4