Difference between two lists to create a dataset

Question

I have a dataset, like this mushrooms <- read.csv("mushrooms.csv") and now I already have a mushrooms.training_set which is 1/3 of the whole dataset. For both variables, typeof() returns list.

Now, I want to select the rows in the original dataset mushrooms, that are not in the mushrooms.training_set. How would I do this? I have tried the following:

mushrooms[c(!mushrooms.training_set),] but this returns something in the order of 64K rows.
mushrooms[!mushrooms.training_set,]
mushrooms[!duplicated(mushrooms.training_set)]

Who helps me out?

`typeof()` is of very limited use because of how general it is. `class()` or especially `str()` tell you much more useful information. Almost certainly your data is a `data.frame`, which is much more useful information than saying it's a `list`. Data frames are a very special kind of list. — Gregor Thomas, Jan 26 '18 at 14:43
can you please show us something like `str(mushrooms)`, `str(mushrooms.training_set)` ? — Ben Bolker, Jan 26 '18 at 14:45
I've been seeing `typeof()` in newbie questions more frequently lately. Is there some blog post or course telling people to use `typeof`? Something I could comment on or send an email to the author to tell them to stop giving bad advice? — Gregor Thomas, Jan 26 '18 at 14:46
I'm not sure where I got it from, I'll retrace my steps later. And I'll let you know :) — jbehrens94, Jan 26 '18 at 14:49
https://stackoverflow.com/questions/12693908/get-type-of-all-variables This question for example. — jbehrens94, Jan 26 '18 at 14:50
I suggest rather than proceed like this you have look at the `caret` package which contains a framework for splitting data into training and test sets. Have a look at this example for instance: https://stackoverflow.com/a/13575580/1527403 — Stephen Henderson, Jan 26 '18 at 14:51
@jbehrens94 yes, but that's an old question and doesn't seem like the thing too many people would stumble on. I feel like I've seen 4-5 questions using `typeof` in the last few months, and maybe 1 in the previous 3 years. — Gregor Thomas, Jan 26 '18 at 14:53
It's the one I actually found when typing 'R type of variable', this was the first hit. — jbehrens94, Jan 26 '18 at 14:55

Gregor Thomas · Accepted Answer · 2018-01-26T14:56:17.640

5

From where you are in the question, you can use dplyr::setdiff:

library(dplyr)
mushroooms.test = setdiff(mushrooms, mushrooms.training_set)

But most of the time it's easier to create the test set using at the same time as the training set. Lots of examples here at How to split data into training and test sets?

edited Jan 26 '18 at 14:56

answered Jan 26 '18 at 14:45

Gregor Thomas

136,190
20
167
294

huh. Does `setdiff` work on an entire data frame? – Ben Bolker Jan 26 '18 at 14:47
I just tried this code, but I end up with the same dataset as `mushrooms`? – jbehrens94 Jan 26 '18 at 14:47
Oops, no it doesn't. Really thought it did. – Gregor Thomas Jan 26 '18 at 14:48
Ah, `dplyr::setdiff` does work on entire data frames. – Gregor Thomas Jan 26 '18 at 14:51
+1 for the last two sentences, which answer the OP's underlying problem. However, the OP's literal question (difference between two data.frames) is [clearly a dupe](https://stackoverflow.com/questions/28702960/find-complement-of-a-data-frame-anti-join) and a common one at that – C8H10N4O2 Jan 26 '18 at 14:59
I actually am doing that, but I just think I wanted to outsmart myself unknowingly, haha. – jbehrens94 Jan 26 '18 at 15:03

Difference between two lists to create a dataset

1 Answers1