dplyr: Split data_frame into two randomly

Question

How can I split a data_frame randomly into two without creating an index? sample_n works for me to get one part of it, but how can I collect the other part?

You can do an `anti_join` with the extracted part as `y`-dataframe and the original as `x`-dataframe. — Jaap, Sep 19 '15 at 17:42
@Jaap I was trying something similar with `filter`, (not) `%in%` and `row.names`. Let me try your suggestion. — tchakravarty, Sep 19 '15 at 17:47
I'm sort of curious how much efficiency this really buys you over having an index: how big are your data sets? — Ben Bolker, Sep 19 '15 at 17:55
(e.g. `samp <- sample(nrow(df),size=10)); dfx <- df[,samp]; dfy <- df[,-samp]`) (this will probably be better than using `%in%` ...) — Ben Bolker, Sep 19 '15 at 18:02
If this wasn't tagged with [tag:dplyr], this should have been instantly closed with this http://stackoverflow.com/questions/17200114/how-to-split-data-into-training-testing-sets-using-sample-function-in-r-program. And next time please provide MRE and conduct some Google search. You enough experienced user to know that. — David Arenburg, Sep 19 '15 at 18:56

score 7 · Accepted Answer · answered Sep 19 '15 at 17:47

7

You can do an anti_join with the extracted part as y-dataframe and the original as x-dataframe. A small example:

library(dplyr)

df <- data_frame(x=1:20,y=runif(20))
dfy <- df %>% sample_n(10, replace=FALSE)
dfx <- anti_join(df, dfy, by="x")

this results in the following dataframes:

> df
Source: local data frame [20 x 2]

    x          y
1   1 0.64147504
2   2 0.35766839
3   3 0.44875782
4   4 0.01905876
5   5 0.85655599
6   6 0.88191481
7   7 0.46532067
8   8 0.09831802
9   9 0.31158184
10 10 0.39504048
11 11 0.81358862
12 12 0.41702158
13 13 0.80441008
14 14 0.69928890
15 15 0.19040897
16 16 0.94120853
17 17 0.65289448
18 18 0.46844427
19 19 0.63177479
20 20 0.58288923

the one half:

> dfx
Source: local data frame [10 x 2]

    x         y
1  19 0.6317748
2  17 0.6528945
3  16 0.9412085
4  15 0.1904090
5  14 0.6992889
6  11 0.8135886
7   7 0.4653207
8   6 0.8819148
9   5 0.8565560
10  3 0.4487578

the other half:

> dfy
Source: local data frame [10 x 2]

    x          y
1  18 0.46844427
2   8 0.09831802
3  12 0.41702158
4   4 0.01905876
5   2 0.35766839
6  10 0.39504048
7  13 0.80441008
8   9 0.31158184
9   1 0.64147504
10 20 0.58288923

answered Sep 19 '15 at 17:47

Jaap

81,064
34
182
193

Any way that this can be done using `row.names` in case there isn't a unique index in the data? – tchakravarty Sep 19 '15 at 17:54
@fgnu Actually, `row.names` is a kind of unique index. If there are `row.names` present, then you should be able to subset the original dataframe by excluding the rownames of the sample (with e.g. `!=`) – Jaap Sep 19 '15 at 18:04
Hadley hates row.names. Equivalently, row.names are a bad idea – bramtayl Sep 19 '15 at 18:05
@bramtayl I would tend to do what Jaap is suggesting here -- bring in row.names as a column in the data, and use that, in the absence of a variable like `x` as in the data above. – tchakravarty Sep 19 '15 at 18:11
5

@bramtayl The fact that Hadley Wickham hates `row.names`, doesn't mean they are a bad idea. Hadley ≠ god. – Jaap Sep 19 '15 at 18:13
3

Be careful lest you get struck down with lightning – bramtayl Sep 19 '15 at 18:30
1

@DavidArenburg, I think your comment is inappropriate. Not flagging it (yet) ... I disagree with bramtayl's "equivalently" statement too, but second ("lightning") comment is probably tongue-in-cheek. – Ben Bolker Sep 19 '15 at 18:49
@bramtayl To be clear: I respect Hadley for what he is doing for the R eco-system. However, there are some people who follow everything he says blindly. My motto: always try to think critical. – Jaap Sep 19 '15 at 18:54
1

@DavidArenburg, fine, I'm flagging as "rude or offensive". But why not just delete it yourself? (In fact, why not be a little bit more polite to begin with?) To be clear, I'm not disagreeing with yours or Jaap's opinions (in fact I share them to some extent) that Hadley is smart but not always right, just with the way you expressed yourself. – Ben Bolker Sep 19 '15 at 18:55
1

What does it have to do with row.names? – bramtayl Sep 19 '15 at 20:57

dplyr: Split data_frame into two randomly

1 Answers1