0

I have a question regarding a command. We used in a class runif to create a training set, that should contain 50% of the data set. (we developed a decision tree based on this training set). But I still can't understand the logic behind this command, could someone explain to me how this works?

I understand the decision trees, and the logic behind splitting up a data set, my question is just explicitly about how this command works.

inTrain <- runif(nrow(USArrests)) < 0.5
desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • 1
    Does this answer your question? [How to split data into training/testing sets using sample function](https://stackoverflow.com/questions/17200114/how-to-split-data-into-training-testing-sets-using-sample-function) – Rémi Coulaud Jul 08 '20 at 17:58

1 Answers1

1

You have a dataset named USArrests with length nrow(USArrests), let's say for the sake of simplification 100. So runif(nrow(USArrests)) creates 100 uniform distributed random numbers i.e. for every row in your dataset one number.

Next your expression runif(nrow(USArrests)) < 0.5 checks, if the number is < 0.5 or not returning TRUE or FALSE. This gives you a logical vector of length 100 (or nrow(USArrests)) that indicates, if a row belongs to the training or to the test dataset.

It's not shown but finally you select your training data by

USArrests[inTrain,]

and your test data by

USArrests[-inTrain,]
Martin Gal
  • 16,640
  • 5
  • 21
  • 39