Remove rows of a sample from the original data frame

Question

The data set that I am using has missing values in them, so I have to use Amelia package for imputations, the resulting data set is of the following form:

Bi.Rads     Age        Shape      Margin      Density     Severity
5.000000     70.00000 3.4685058  5.00000000 3.000000        1
5.000000     70.00000 4.0000000  3.00000000 3.000000        1
5.000000     70.00000 4.0000000  4.00000000 3.000000        1
5.000000     70.00000 4.0000000  5.00000000 3.000000        1
5.000000     70.00000 4.2881664  4.00000000 3.689292        1
5.000000     70.27765 4.0000000  4.00000000 3.000000        1

The values in decimal are the imputed one's. Now considering this data set as a data frame df, I am randomly sampling 100 rows from df without replacement

df1<-df[sample(nrow(df),100),]

Now, I want to remove df1 from df, and I have tried every suggestion on similar posts like using %in%, used dplyr package which doesn't return 861 rows. I tried to comment on other posts but I couldn't because I don't have enough reputation. Could you please help me out? None of the techniques like using packages sqldf, compare have worked so far.

consider saving a vector that has the number of rows you want for df1, and then creating df2 that is all but those in the vector. `keep <- sample(nrow(df), 100)` then `df1 <- df[keep, ]` and `df2 <- df[-keep, ]` — mdgbeck, Sep 27 '16 at 18:29
@AOK3000 I am using R studio and I tried you suggestion, it says 861 observations in the environment window but when I print it, it prints all the 961 of them. Not sure if it's right. — Varun, Sep 27 '16 at 18:41
When you are printing you are seeing the original row names. The environment is accurate. You can double check by calling `dim(df2)`. — mdgbeck, Sep 27 '16 at 21:39
@AOK3000 could you answer another question: I am calculating this confusion matrix: `table(pred = svm.pred, true = testset[,10])` I want to repeat the process of taking 10% of the records using random samples then compute the confusion matrix 20 times. I want to know how can I take a mean of all the confusion matrices I calculate using this table function? I want to find out the average accuracy of svm. — Varun, Sep 27 '16 at 22:04

score 1 · Accepted Answer · answered Sep 27 '16 at 18:32

1

Try this:

indices <- sample(1:nrow(df), 100)
df <- df[-indices,]

answered Sep 27 '16 at 18:32

Sandipan Dey

21,482
2
51
63

I tried your suggestion and got this error`random<-mammo[sample(1:nrow(mammo),100),] mammo<-mammo[-random,] Error in xj[i] : invalid subscript type 'list'` – Varun Sep 27 '16 at 18:35
random<-sample(1:nrow(mammo),100); mammo<-mammo[-random,] – Sandipan Dey Sep 27 '16 at 18:36
yeah did this too, and tried to print mammo, it prints all 961 of them, but shows the number of observations as 861. – Varun Sep 27 '16 at 18:46
thanks, even though it prints all 961 records only 861 are considered. – Varun Sep 27 '16 at 23:22

score -2 · Answer 2 · edited May 23 '17 at 10:27

-2

Here you go, the following is analogous to !%in% when subsetting a single dataframe, but here it's used to keep or delete rows.

library(dplyr)

Desired_data<-anti_join(df, df1)

Source:

Find complement of a data frame (anti - join)

edited May 23 '17 at 10:27

Community

1
1

answered Sep 27 '16 at 18:33

InfiniteFlash

1,038
1
10
22

Does not work here since my imputed data set is in decimal. – Varun Sep 27 '16 at 19:30
I'm not sure why that would matter, but look at other alternatives. – InfiniteFlash Sep 27 '16 at 19:52
I say this because I tried sqldf on another data set which does not have decimal values and for that it worked correctly. – Varun Sep 27 '16 at 20:21

Remove rows of a sample from the original data frame

2 Answers2