Subsetting only on an ID field which does not have duplicates (looking for something faster than a for loop)

Question

used R a lot in college but I don't use it anymore and I admit I am a little rusty. I know that for loops are not that great for larger datasets, and I know there's a much more efficient way to do what I'm about to do.

Please excuse me if I use terms incorrectly, and correct me if I'm wrong.

My data is a csv file with 3 columns. The first column is an ID field, and the second and third columns are different plans.

The ID field has a lot of duplicates, and I only want to use data where the ID fields are not duplicates. So if my data is 1, 3, 3, 5, 6, I only want to use 1, 5, and 6.

This is what I've done.

plan.get = function(plans){
  counts = as.data.frame.table(table(plans[,1]))
  unique.counts = counts[which(counts[,2] == "1"),]
  unique.plans = matrix(0, length(unique.counts[,1]), 3)

for (i in 1:length(unique.plans[,1])){
  unique.plans[i,1] = plans[which(plans[,1] == unique.counts[i,1]),1]
  unique.plans[i,2] = as.character(plans[which(plans[,1] ==  unique.counts[i,1]),2])
  unique.plans[i,3] = as.character(plans[which(plans[,1] == unique.counts[i,1]),3])
}

return(unique.plans)
}

I'm thinking of things like array functions or applys, but reading the documentation is just so far beyond me at this stage. Any help would be greatly appreciated!

Example of Data - 'plans'

1    Plan 1    Plan 2
3    Plan 2    Plan 2
3    Plan 2    Plan 2
5    Plan 3    Plan 1
6    Plan 2    Plan 3
7    Plan 1    Plan 2
7    Plan 3    Plan 1
8    Plan 2    Plan 3

And what I want my final output to be is (Because 3 and 7 are duplicates)

1    Plan 1    Plan 2
5    Plan 3    Plan 1
6    Plan 2    Plan 3
8    Plan 2    Plan 3

To be clear, the function I've written does exactly what I want it to do, but because of the for loop it is incredibly slow.

It would be helpful if you included a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input data. This will make it easier to give you a more specific answer and test any possible solutions. — MrFlick, Aug 03 '16 at 14:05
try `noDUP_ID = DF$ID[!duplicated(DF$id)]` , this will output only nonduplicated id's — Silence Dogood, Aug 03 '16 at 14:07
@Osssan that will not do what the OP asked for. That will keep one of the repeated observations rather than dropping them all. See `duplicated(c(3,3,3))` — MrFlick, Aug 03 '16 at 14:16
You are right, I misread. `setdiff(unique(DF$id),DF$id[duplicated(DF$id)])` , this should exclude the duplicated values — Silence Dogood, Aug 03 '16 at 14:32
Here's another maybe-helpful link: http://stackoverflow.com/q/21946201/ — Frank, Aug 03 '16 at 14:51

score 3 · Answer 1 · answered Aug 03 '16 at 14:28

You can simply find which IDs have duplicates and exclude them from the dataset

 df[!df$ID %in% which(duplicated(df$ID)),]
#  ID   var1   var2
#1  1 Plan 1 Plan 2
#4  5 Plan 3 Plan 1
#5  6 Plan 2 Plan 3
#8  8 Plan 2 Plan 3

DATA

structure(list(ID = c(1L, 3L, 3L, 5L, 6L, 7L, 7L, 8L), var1 = c("Plan 1", 
"Plan 2", "Plan 2", "Plan 3", "Plan 2", "Plan 1", "Plan 3", "Plan 2"
), var2 = c("Plan 2", "Plan 2", "Plan 2", "Plan 1", "Plan 3", 
"Plan 2", "Plan 1", "Plan 3")), .Names = c("ID", "var1", "var2"
), row.names = c(NA, -8L), class = "data.frame")

Also `df[ with(df, ave(ID, ID, FUN = length) == 1L), ]` and `df[with(df, !(duplicated(ID) | duplicated(ID, fromLast=TRUE))), ]` — Frank, Aug 03 '16 at 14:48

score 0 · Answer 2 · answered Aug 03 '16 at 14:48

0

The above works, and I figured something else as well...

unique.plans = plans[!is.na(match(plans[,1], unique.counts[,1])),]

replaces my for loop. Maybe not the most elegant, but did the job for me.

Thanks for the help everyone!

answered Aug 03 '16 at 14:48

hordan

21
2

Subsetting only on an ID field which does not have duplicates (looking for something faster than a for loop)

2 Answers2