2

I have two tables to join based on date-time fields. To recreate the scenario let us take the commercial vs sale example. We want to know which sale is related to which commercial.

Sale can only be tagged to the last commercial and only if it occurred after the commercial.

Also, if a sale happened after multiple commercials, we can only tag the sale to the last commercial; prior commercials will have a null in the join.

I am unable to get this last part. If there is a sale after multiple commercials, then all such commercials are joined with that sale; which I don't want. In my example, the sale that happened at "2017-01-01 02:05:00" should join with the commercial that aired at "2017-01-01 02:00:00" and not the prior commercials.

output of the code

library(lubridate)
library(data.table)

ts <- seq(as.POSIXct("2017-01-01", tz = "UTC"),
          as.POSIXct("2017-01-02", tz = "UTC"),
          by = "30 min")

commercial <-
  data.table(
    c_row_number = 1:10,
    c_time       = ts[1:10],
    c_time_roll  = ts[1:10]
  )

sale <-
  data.table(
    s_row_number = 1:4,
    s_time       = ts[5:8] + minutes(5),
    s_time_roll  = ts[5:8] + minutes(5)
  )

setkey(commercial, c_time_roll)
setkey(sale, s_time_roll)

tbl_joined <- sale[commercial, roll = -Inf] # , mult = 'last']

Any idea how can we get NAs where c_row_number is 1, 2, 3, and 4. Thank you.

Imtiaz
  • 69
  • 1
  • 7

2 Answers2

1

There's no way to do this directly -- x[i] uses i to look up rows in x. mult is used for the reverse -- when multiple rows in x match to a single row in i. Here, multiple rows in i match to a single row in x.

Your best bet then, is to operate post-join on the resulting table. For example, to drop those rows, you could use unique:

unique(sale[commercial, roll = -Inf], by = 's_row_number', fromLast = TRUE)
#    s_row_number              s_time         s_time_roll c_row_number
# 1:            1 2017-01-01 02:05:00 2017-01-01 02:00:00            5
# 2:            2 2017-01-01 02:35:00 2017-01-01 02:30:00            6
# 3:            3 2017-01-01 03:05:00 2017-01-01 03:00:00            7
# 4:            4 2017-01-01 03:35:00 2017-01-01 03:30:00            8
# 5:           NA                <NA> 2017-01-01 04:30:00           10
#                 c_time
# 1: 2017-01-01 02:00:00
# 2: 2017-01-01 02:30:00
# 3: 2017-01-01 03:00:00
# 4: 2017-01-01 03:30:00
# 5: 2017-01-01 04:30:00

I suspect you're creating {s,c}_row_number just for this task; to do so without those columns, you could do:

sale[commercial, roll = -Inf][order(-c_time)][rowid(s_time) == 1L]

We sort in reverse by c_time to make sure rowid gets the most recent value.

Note that in both cases, one of the is.na(s_time) rows has been dropped.

Hopefully this gets you going in the right direction.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • Thank you Michael for your time and especially to elaborate on use of 'mult'. Actually, I don't want to drop rows; I want to keep them with nulls in the joined columns as if there was no records in the other table to join. – Imtiaz Jul 07 '19 at 08:24
1

If your commercial times are sorted, or you can sort them, then you can use a non-equi join with a helper column with shifted times:

library(lubridate)
library(data.table)

ts <- seq(as.POSIXct("2017-01-01", tz = "UTC"),
          as.POSIXct("2017-01-02", tz = "UTC"),
          by = "30 min")

commercial <-
  data.table(
    c_row_number = 1:10,
    c_time       = ts[1:10],
    c_next_time  = shift(ts[1:10], type = "lead", fill = max(ts))
  )

sale <-
  data.table(
    s_row_number = 1:4,
    s_time       = ts[5:8] + minutes(5),
    s_time_join  = ts[5:8] + minutes(5)
  )

tbl_joined <- sale[commercial, on = .(s_time_join >= c_time, s_time_join < c_next_time)]

And if you want to use this idiom:

commercial[, s_time := sale[.SD,
                            .(s_time),
                            on = .(s_time_join >= c_time, s_time_join < c_next_time)]]
print(commercial)
    c_row_number              c_time         c_next_time              s_time
 1:            1 2017-01-01 00:00:00 2017-01-01 00:30:00                <NA>
 2:            2 2017-01-01 00:30:00 2017-01-01 01:00:00                <NA>
 3:            3 2017-01-01 01:00:00 2017-01-01 01:30:00                <NA>
 4:            4 2017-01-01 01:30:00 2017-01-01 02:00:00                <NA>
 5:            5 2017-01-01 02:00:00 2017-01-01 02:30:00 2017-01-01 02:05:00
 6:            6 2017-01-01 02:30:00 2017-01-01 03:00:00 2017-01-01 02:35:00
 7:            7 2017-01-01 03:00:00 2017-01-01 03:30:00 2017-01-01 03:05:00
 8:            8 2017-01-01 03:30:00 2017-01-01 04:00:00 2017-01-01 03:35:00
 9:            9 2017-01-01 04:00:00 2017-01-01 04:30:00                <NA>
10:           10 2017-01-01 04:30:00 2017-01-02 00:00:00                <NA>
Alexis
  • 4,950
  • 1
  • 18
  • 37
  • Thank you Alexis for your time. I was initially doing non-equality join; then thought to use rolling join. The issue I face with the above suggestion is that here the same commercial can be joined with multiple sales. For example when we have two sales between two commercials; both will join to the same commercial. This was my second condition mentioned in the question. Can your suggestion be expanded to take care of it? – Imtiaz Jul 07 '19 at 08:21
  • Can the inequality join logic be added to rolling join? I will also try that. – Imtiaz Jul 07 '19 at 08:28
  • Did not work: it says "roll is not implemented for non-equi joins yet." – Imtiaz Jul 07 '19 at 08:42
  • 1
    Adding a mult condition to non-equi join seems to work: tbl_joined <- sale[commercial, on = .(s_time_join >= c_time_join, s_time_join < c_next_time), mult = "first"] Does this seem correct? If so, please modify your answer for future reference. In any case, I will accept this answer. – Imtiaz Jul 07 '19 at 09:03
  • @Imtiaz I believe that using `mult` should work for that, but I didn't understand that requirement from the original question. I thought you said that 1 sale shouldn't match to many commercials, but I thought many sales matching to 1 commercial was allowed. – Alexis Jul 07 '19 at 09:24
  • Sorry, my question was unclear. I wanted 1 commercial to map to 1 sale only; the first one that occur after the commercial; I was using roll for the same. And also, 1 sale to map to 1 commercial only; which was the part I was not getting. Anyhow, both the requirements are now being met by slightly modifying your answer. Thank you much. – Imtiaz Jul 07 '19 at 09:39