-2

I have two data frames: raw2 which has 28,406 records and raw3 26,421 records.

The records in raw3 are a subset of those in raw2. In fact raw3 was derived using:

raw3<-setDT(raw2)[order(O_ID, Program_forsorting), head(.SD, 1), .(O_ID)]

I now have a setdiff function where I'm trying to pull the records that did not get carried over from raw2 to raw3 using:

settdiff(raw2,raw3)

The results should have 1,985 records. However, the results have 28,406 which represents raw2. If I switch the formula around to settdiff(raw3,raw2) the results contains 26,421 records.

What am I doing wrong?

Here is sample data

raw2<-as.data.frame(cbind("col1"=c("a","h","b","f","g"),"O_ID"=c(1,1,1,4,5), "Program_forsorting"=c("p1","p2","p2","p3","p1")))
Danny
  • 554
  • 1
  • 6
  • 17
  • Are there multiple columns? Are you trying to check that all values in all columns match? When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. You don't have to share your actual data, but unless we can reproduce what's going on, it's not easy to help. – MrFlick Apr 06 '18 at 19:18
  • I didn't provide sample data initially since I thought it was associated with the size of the dataset. I've provided sample data above. – Danny Apr 06 '18 at 19:30

2 Answers2

1

I do not believe setdiff works on datatables directly since it takes a vector for input... you'd have to create a function and apply over all columns. I'd try using the native datatable function fsetdiff. Make sure both objects are datatable objects.

fsetdiff(raw2,raw3)
D.sen
  • 938
  • 5
  • 14
0

Someone posted a response and then deleted it. However, it works so I'll share here.

If data goes over multiple columns I should have used the fsetdiff function.

fsettdiff(raw2,raw3)
Danny
  • 554
  • 1
  • 6
  • 17