3

Suppose I have two distinct data sets Data1 and Data2. For each entry in Data1$Incidents I want to find the rows in Data2$Incidents which match it, and also keep track of entries which have no matches. I subsequently save the entries which match into a new data frame Data1_Matches. Now for each entry in Data2$Incidents I look for the entries in Data1_Matches$Incidents which match, and then create an analogous data frame Data2_Matches.

Suppose for the sake of argument my data sets look like the following:

Day    Incidents
"Monday"    30
"Friday"    11
"Sunday"    27

My algorithm at the moment looks like the following:

Data1_Incs = as.integer(Data1$Incidents)
LEN1     = length(Data1_Incs)
No_Match = 0

for (k in 1:LEN1){
  Incs = which(Data2$Incidents == Data1_Incs[k])
  if (length(Incs) == 0){
    No_Match = c(No_Match,k)
  }
}
No_Match = No_Match[-1]

Data1_Match    <- Data1[-No_Match,]
Data1_No_Match <- Data1[ No_Match,]

Data2_Incs = Data2$Incidents
LEN2       = length(Data2_Incs)
Un_Match   = 0

for (j in 1:LEN2){
  Incs = which(as.integer(Data1_Match$Incidents) == Data2_Incs[j])
  if (length(Incs) == 0){
    Un_Match = c(Un_Match, j)
  }
}
Un_Match = Un_Match[-1]

Data2_Match    <- Data2[-Un_Match,]
Data2_No_Match <- Data2[ Un_Match,]

What is a better way for me to accomplish this task, without using a for loop? For reference Data1 has about 15,000 entries while Data2 has closer to two million.

Mnifldz
  • 145
  • 12
  • 4
    if you provide a small, self-contained sample of each data set it will be much easier to help – Señor O Aug 04 '15 at 23:03
  • Matches in what sense? The whole row is identical? Or are you using a column or set of columns as a key? – ulfelder Aug 04 '15 at 23:15
  • @ulfelder The sample data set I gave isn't amazing, but I mean matching in this sense: For every row in `Data1` I am looking to see if there exists an entry in `Data2$Incidents` that matches that individual entry from `Data1`. If there is no match, I save the index so that I can divide `Data1` into two subsequent data sets: one with matches and one without matches. – Mnifldz Aug 04 '15 at 23:25
  • @SeñorO I added a tiny sample of what the code "might" look like. Unfortunately I cannot share the actual contents. Does this example seem helpful at all? – Mnifldz Aug 04 '15 at 23:27
  • @Mnifldz that's not really helpful. Read this: http://stackoverflow.com/q/5963269/1717913 – Señor O Aug 04 '15 at 23:30
  • You could also use SQL functions to partition the dataset using the `sqldf` package – Michal Aug 04 '15 at 23:33

1 Answers1

3

Try to use setdiff.

I wil demonstrate on the first for loop:

No_Match <- setdiff(unique(Data2$Incidents), unique(Data1$Incidents))

Not sure if this quite satisfies your requirement.

Michal
  • 1,863
  • 7
  • 30
  • 50
  • Conceivably I wouldn't need a for loop using `setdiff` since it would give me the (unique) entries in `Data2$Incidents` not contained in `Data1$Incidents`, correct? If so, how would I subset `Data1` to contain only those incidents not contained in `No_Match`? – Mnifldz Aug 04 '15 at 23:40
  • This would give you the actual values and therefore eliminate the extra step of getting the subsets and then subsetting the dataset. – Michal Aug 04 '15 at 23:41
  • Thank you! `setdiff` was what I needed. I was able to completely remove the for loops and my script runs pretty quickly now. – Mnifldz Aug 04 '15 at 23:56