0

What's the most reliable way to remove matching Ids from two large data frames in large?

For example, I have a list of participants who do not want to be contacted (n=200). I would like to remove them from my dataset of over 100 variables and 200,000 observations.

This is the list of 200 participants ids that I need to remove from the dataset.

exclude=read.csv("/home/Project/file/excludeids.csv", header=TRUE, sep=",") 
dataset.exclusion<- dataset[-which(exclude$ParticipantId %in% dataset$ParticipantId  ), ]  

Is this the correct command to use?

I don't think this command is doing what I want, because when I verify with the following: length(which(dataset.exclusion$ParticipantId %in% exclusion$ParticipantId)) I don't get 0.

Any insight?

Frank
  • 66,179
  • 8
  • 96
  • 180
Tan
  • 117
  • 7
  • It is much easier to help you if you provide a [minimal, reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610). – Henrik Oct 07 '13 at 16:47
  • The dataset is quite massive. Every participant has an unique 7-digit Id with variables on DOB, gender, and assortment of questionnaire data. I have a specific list of participant ids who do not want to be contacted further (this isn't a variable in the original dataset), and therefore want to remove these ~200 rows from the dataframe. – Tan Oct 07 '13 at 16:54
  • As described in the link, you don't need to post the whole data in a _minimal_ reproducible example. Have a look at the dummy data in @Codoremifa's answer - a few rows that easily fits in a R console, and only the relevant columns. Not more, not less. – Henrik Oct 07 '13 at 16:58
  • Why would you expect to get 0? That would mean there's no common ids. – Señor O Oct 07 '13 at 17:38

2 Answers2

2

You can do this for example:

sample1[!sample1$ParticipantID %in% 
            unique(exclusion$ParticipantId),]
agstudy
  • 119,832
  • 17
  • 199
  • 261
1

Something like this?

library(data.table)

dataset <- data.table(
a = c(1,2,3,4,5,6),
b = c(11,12,13,14,15,16),
d = c(21,22,23,24,25,26)
)

setkeyv(dataset, c('a','b'))

ToExclude <- data.table(
a = c(1,2,3),
b = c(11,12,13)
)

dataset[!ToExclude]

#    a  b  d
# 1: 4 14 24
# 2: 5 15 25
# 3: 6 16 26
TheComeOnMan
  • 12,535
  • 8
  • 39
  • 54