1

The data set that I am using has missing values in them, so I have to use Amelia package for imputations, the resulting data set is of the following form:

Bi.Rads     Age        Shape      Margin      Density     Severity
5.000000     70.00000 3.4685058  5.00000000 3.000000        1
5.000000     70.00000 4.0000000  3.00000000 3.000000        1
5.000000     70.00000 4.0000000  4.00000000 3.000000        1
5.000000     70.00000 4.0000000  5.00000000 3.000000        1
5.000000     70.00000 4.2881664  4.00000000 3.689292        1
5.000000     70.27765 4.0000000  4.00000000 3.000000        1  

The values in decimal are the imputed one's. Now considering this data set as a data frame df, I am randomly sampling 100 rows from df without replacement

df1<-df[sample(nrow(df),100),]

Now, I want to remove df1 from df, and I have tried every suggestion on similar posts like using %in%, used dplyr package which doesn't return 861 rows. I tried to comment on other posts but I couldn't because I don't have enough reputation. Could you please help me out? None of the techniques like using packages sqldf, compare have worked so far.

Varun
  • 49
  • 1
  • 11
  • 1
    consider saving a vector that has the number of rows you want for df1, and then creating df2 that is all but those in the vector. `keep <- sample(nrow(df), 100)` then `df1 <- df[keep, ]` and `df2 <- df[-keep, ]` – mdgbeck Sep 27 '16 at 18:29
  • Oh this is absolutely a duplicate question. – InfiniteFlash Sep 27 '16 at 18:33
  • @AOK3000 I am using R studio and I tried you suggestion, it says 861 observations in the environment window but when I print it, it prints all the 961 of them. Not sure if it's right. – Varun Sep 27 '16 at 18:41
  • When you are printing you are seeing the original row names. The environment is accurate. You can double check by calling `dim(df2)`. – mdgbeck Sep 27 '16 at 21:39
  • @AOK3000 could you answer another question: I am calculating this confusion matrix: `table(pred = svm.pred, true = testset[,10])` I want to repeat the process of taking 10% of the records using random samples then compute the confusion matrix 20 times. I want to know how can I take a mean of all the confusion matrices I calculate using this table function? I want to find out the average accuracy of svm. – Varun Sep 27 '16 at 22:04

2 Answers2

1

Try this:

indices <- sample(1:nrow(df), 100)
df <- df[-indices,]
Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
  • I tried your suggestion and got this error`random<-mammo[sample(1:nrow(mammo),100),] mammo<-mammo[-random,] Error in xj[i] : invalid subscript type 'list'` – Varun Sep 27 '16 at 18:35
  • random<-sample(1:nrow(mammo),100); mammo<-mammo[-random,] – Sandipan Dey Sep 27 '16 at 18:36
  • yeah did this too, and tried to print mammo, it prints all 961 of them, but shows the number of observations as 861. – Varun Sep 27 '16 at 18:46
  • thanks, even though it prints all 961 records only 861 are considered. – Varun Sep 27 '16 at 23:22
-2

Here you go, the following is analogous to !%in% when subsetting a single dataframe, but here it's used to keep or delete rows.

library(dplyr)

Desired_data<-anti_join(df, df1)

Source:

Find complement of a data frame (anti - join)

Community
  • 1
  • 1
InfiniteFlash
  • 1,038
  • 1
  • 10
  • 22