0

I have a dataset, like this mushrooms <- read.csv("mushrooms.csv") and now I already have a mushrooms.training_set which is 1/3 of the whole dataset. For both variables, typeof() returns list.

Now, I want to select the rows in the original dataset mushrooms, that are not in the mushrooms.training_set. How would I do this? I have tried the following:

  • mushrooms[c(!mushrooms.training_set),] but this returns something in the order of 64K rows.
  • mushrooms[!mushrooms.training_set,]
  • mushrooms[!duplicated(mushrooms.training_set)]

Who helps me out?

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
jbehrens94
  • 2,356
  • 6
  • 31
  • 59
  • 3
    `typeof()` is of very limited use because of how general it is. `class()` or especially `str()` tell you much more useful information. Almost certainly your data is a `data.frame`, which is much more useful information than saying it's a `list`. Data frames are a very special kind of list. – Gregor Thomas Jan 26 '18 at 14:43
  • can you please show us something like `str(mushrooms)`, `str(mushrooms.training_set)` ? – Ben Bolker Jan 26 '18 at 14:45
  • You are right, @Gregor, they are both data.frames – jbehrens94 Jan 26 '18 at 14:45
  • 1
    I've been seeing `typeof()` in newbie questions more frequently lately. Is there some blog post or course telling people to use `typeof`? Something I could comment on or send an email to the author to tell them to stop giving bad advice? – Gregor Thomas Jan 26 '18 at 14:46
  • I'm not sure where I got it from, I'll retrace my steps later. And I'll let you know :) – jbehrens94 Jan 26 '18 at 14:49
  • https://stackoverflow.com/questions/12693908/get-type-of-all-variables This question for example. – jbehrens94 Jan 26 '18 at 14:50
  • I suggest rather than proceed like this you have look at the `caret` package which contains a framework for splitting data into training and test sets. Have a look at this example for instance: https://stackoverflow.com/a/13575580/1527403 – Stephen Henderson Jan 26 '18 at 14:51
  • @jbehrens94 yes, but that's an old question and doesn't seem like the thing too many people would stumble on. I feel like I've seen 4-5 questions using `typeof` in the last few months, and maybe 1 in the previous 3 years. – Gregor Thomas Jan 26 '18 at 14:53
  • It's the one I actually found when typing 'R type of variable', this was the first hit. – jbehrens94 Jan 26 '18 at 14:55

1 Answers1

5

From where you are in the question, you can use dplyr::setdiff:

library(dplyr)
mushroooms.test = setdiff(mushrooms, mushrooms.training_set)

But most of the time it's easier to create the test set using at the same time as the training set. Lots of examples here at How to split data into training and test sets?

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294