3

I am trying to join two data frames which in SQL would utilise a where and a between statement for dates.

In SQL, the code would be:

select Date,(Value1-Test1) as Ans1,(Value2-Test2) as Ans2,ID
from Data a
 inner join Test b on a.ID=b.ID and a.Date between b.DateStart and c.DateEnd 

This is Data

Date                Value1  Value2  ID
01/01/16 19:30:00       10      30  A
01/01/16 19:50:20       20      40  B
01/01/16 19:55:30       30      50  C

This is Test

RowNumber   DateStart            DateEnd    Test1   Test2   ID
1   01/01/16 17:00:00   01/01/16 22:00:05   2          4    A
2   01/01/16 22:00:06   01/01/16 01:50:00   3          6    A
3   01/01/16 17:00:00   01/01/16 22:00:05   4          8    B
4   01/01/16 22:00:06   01/01/16 01:50:00   5          2    B
5   01/01/16 17:00:00   01/01/16 22:00:05   6          4    C
6   01/01/16 22:00:06   01/01/16 01:50:00   7          5    C

The results I am trying to create

Date                  Ans1    Ans2 ID
01/01/16 19:30:00        8      26  A
01/01/16 19:50:12       16      32  B
01/01/16 19:55:24       24      46  C

Any help and pointers would be great.

zx8754
  • 52,746
  • 12
  • 114
  • 209
  • 1
    Maybe use `sqldf` package? Also, see [How to join (merge) data frames (inner, outer, left, right)?](http://stackoverflow.com/questions/1299871) – zx8754 Feb 07 '16 at 19:01
  • 1
    http://stackoverflow.com/questions/25815032/finding-overlaps-between-interval-sets-efficient-overlap-joins – zx8754 Feb 07 '16 at 19:32
  • what is actually the difference between Data and the results beside the different labels for two of your columns? – MLavoie Feb 07 '16 at 20:07
  • @MLavoie Apologies - I have edited my results set. – Jake Williams Feb 07 '16 at 20:58
  • @zx8754 That post looks really interesting - how would I extend that concept to ensure matching on ID - I have an overlap and a filter type join. I had thought about sqldf but I am trying to avoid using it (feels lazy ;-}) and I'm also conscious that this will be on a large dataset so want to ensure performance – Jake Williams Feb 07 '16 at 21:03
  • I have tried the foverlap function but now get an error message – Jake Williams Feb 07 '16 at 21:37
  • 1
    @zx8754 You are a complete star !!! Thanks so much - it is now working and I have edited the question to show your link and the solution I implemented !!! Brilliant ! – Jake Williams Feb 07 '16 at 22:31

2 Answers2

3

Following advice from @zx8754 I have tried to use data.table::foverlaps()

In Data, rename the Date field to DateStart and create a second date field where DateEnd=Date. Add the following code:

setkey(Data,ID,DateStart,DateEnd) 
setkey(Test,Id,DateStart,DateEnd) 
CompleteDataset <- foverlaps(Data, Test, type="any")

This give me exactly what I want.

Finding Overlaps between interval sets / Efficient Overlap Joins

Community
  • 1
  • 1
  • 1
    These docs are from 1.9.4. If you want 1.9.7 docs you can see this unofficial mirror: https://jangorecki.gitlab.io/data.table/html/foverlaps.html – jangorecki Feb 07 '16 at 22:56
  • I have now tried to port this code CompleteDataset <- foverlaps(Data, Test, type="within", which="FALSE") to Azure ML and receive the following error.....missing value where TRUE/FALSE needed....as far as I can see, it is only expecting one TRUE/FALSE and I have specified that ??? – Jake Williams Feb 08 '16 at 11:00
  • 1
    @JakeWilliams, try `which = FALSE` (without quotes.. `which` argument requires logical values.. "FALSE" is character type.) – Arun Feb 08 '16 at 12:11
  • @Arun I noticed that and changed that but still the same error message ... – Jake Williams Feb 08 '16 at 12:13
  • @JakeWilliams better to post a new question on that, providing data.table version in your Azure ML environment. Also write if you operate on the **same** data in Azure ML you have tested locally or is it new/bigger dataset. – jangorecki Feb 08 '16 at 12:45
0

Simply merge the two datasets on ID, then conditionally filter rows afterwards which corresponds to SQL's JOIN and WHERE clauses. Finally, run calculations and select columns afterwards.

mergedf <- merge(data, test, by="ID")

mergedf <- mergedf[(mergedf$Date >= mergedf$DateStart & 
                    mergedf$Date <= mergedf$DateEnd),]

mergedf$Ans1 <- mergedf$Value1 - mergedf$Test1
mergedf$Ans2 <- mergedf$Value2 - mergedf$Test2

mergedf <- mergedf[c('Date', 'Ans1', 'Ans2', 'ID')]
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • When trying to do the first merge I get an error message telling me I have more than nrow(data)+nrow(test) due to the duplicate values (which we know) and won't allow me to explode it – Jake Williams Feb 07 '16 at 21:53
  • Apologies, R is case sensitive, so `ID`, the merge key, should be capitalized per your posted example. – Parfait Feb 07 '16 at 23:40