-2

I have a dataset with 10000 samples, and 4 classes (0, 1, 2, 3) label.

>>>data.shape
(10000, 250)
>>>label.shape
(10000,)

and, I wonder are there any API that could split the data into training and test data and shuffle?

for example:

(training_data, training_label, test_data, test_label) = split_shuffle(data, label, 80) # 80 means 80% training, 20% test

What is the most efficient way to achieve such functions?

Further, what if we want 5-fold (straight) cross validation data?

null
  • 1,167
  • 1
  • 12
  • 30
  • 3
    Possible duplicate of [Split inputs into training and test sets](https://stackoverflow.com/questions/41859605/split-inputs-into-training-and-test-sets) – Seljuk Gulcan Mar 14 '18 at 13:10
  • 1
    why is this downvoted? The fact that it is duplicated does not mean the question is irrelevant, right? – famargar Mar 14 '18 at 13:30
  • @famargar Requesting an API is roughly equivalent to a request for an off-site resource, and therefore off-topic. The question also doesn't seem to show an actual attempt at solving the problem. – E_net4 Mar 14 '18 at 13:40
  • @E_net4 OK. This is an user with 28 points. I think people here have to realise that this community isn't exactly welcoming for newcomers. Most of the times one could simply suggest/help the OP rephrase the question before jumping on the downvote button. – famargar Mar 14 '18 at 13:56
  • 2
    @famargar: maybe, though that debate has been done to death on _Meta_! The difficulty we have here is the gap between the purpose of the site (collecting questions that will be useful for future readers) and the purpose of each question author (asking about something that will only be useful to them). – halfer Mar 14 '18 at 14:00
  • I hurried and made a mistake by posting an answer to this question. It was upvoted and marked as an accepted answer (+35 rep for me). But since it has been [asked earlier](https://stackoverflow.com/q/41859605/2099607) on SO, and with a simple search query (i.e. "python train test split" or "python cross validation") in a search engine, the OP could find the answer to his/her questions, I would delete my answer to respect the rules of SO and in hope of more high quality questions/answers.... Oops, it seems that I can't delete it since it was accepted by the OP. Would you care to undo it? – today Mar 14 '18 at 14:02
  • @abe-wong thanks for undoing it. I deleted my answer. – today Mar 14 '18 at 14:42
  • @abe-wong. Just a note the say that sklearn's `cross_val_score` doesn't shuffle by default. You will need to pass something like `cv=StratifiedKFold(shuffle=True)` to it if you want to shuffle because your classes are not evenly distributed in your dataset - for example if your data is sorted by class. – Stev Mar 14 '18 at 16:07

2 Answers2

3

SKLearn's train_test_split is what you're looking for, using the following:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
J. Doe
  • 3,458
  • 2
  • 24
  • 42
KonstantinosKokos
  • 3,369
  • 1
  • 11
  • 21
0

If you want to shuffle your data you can use either numpy.shuffle (for numpy array) or df.sample (for pandas df). About splittinf see KonstatinosKokos's answer or play with np.split.

rpanai
  • 12,515
  • 2
  • 42
  • 64