0

I have a table that looks like this;

user_id     timestamp
aa          2018-01-01 12:01 UTC
ab          2018-01-01 05:01 UTC
bb          2018-06-01 09:01 UTC
bc          2018-03-03 23:01 UTC
cc          2018-01-02 11:01 UTC

I have another table that has every week in 2018.

week_id    week_start     week_end
1          2018-01-01     2018-01-07
2          2018-01-08     2018-01-15
3          2018-01-16     2018-01-23
4          2018-01-23     2018-01-30
...        ...            ...

Assume the week_start is a Monday and week_end is a Sunday.

I'd like to do two things. I'd first like to join the week_id to the first table and then I'd like to assign a day to each of the timestamps. My output would look like this:

user_id     timestamp               week_id    day_of_week
aa          2018-01-01 12:01 UTC    1          Monday
ab          2018-01-02 05:01 UTC    1          Tuesday
bb          2018-01-13 09:01 UTC    2          Friday
bc          2018-01-28 23:01 UTC    4          Friday
cc          2018-01-06 11:01 UTC    1          Saturday

In Excel I could easily do this with a vlookup. My main interest is to learn how to join tables in cases like this. For that reason, I won't accept answers that use the weekday function.

Here are both of the tables in a more accessible format.

user_id <- c("aa", "ab", "bb", "bc", "cc")
timestamp <- c("2018-01-01 12:01", "2018-01-01 05:01", "2018-06-01 09:01", "2018-03-03 23:01", "2018-01-02 11:01")

week_id <- seq(1,52)
week_start <- seq(as.Date("2018-01-01"), as.Date("2018-12-31"), 7)
week_end <- week_start + 6

week_start <- week_start[1:52]
week_end <- week_end[1:52]  

table1 <- data.frame(user_id, timestamp)
table2 <- data.frame(week_id, week_start, week_end)
Cauder
  • 2,157
  • 4
  • 30
  • 69
  • 3
    As commented on your last version, please do share the data with `dput`---it makes it copy/pasteable and confirms you are using the appropriate classes. (The timestamp data is particularly annoying to import as you've shared it without delimiters or quotes. `dput(head(your_sample_data))` is much friendlier.) – Gregor Thomas Mar 26 '19 at 02:49
  • I'd also recommend removing the `day_of_week` component from this question. It's entirely separate from your "main interest" in learning how to join these tables. Better to ask it as a separate question. [Or maybe search for it separately](https://stackoverflow.com/q/9216138/903061) before asking. – Gregor Thomas Mar 26 '19 at 02:51
  • `format(Sys.time(), format="%U %a %A")` returns `"12 Mon Monday"`, so it seems that instead of trying to find intervals, you could just convert each date into your `week_id` directly. So, `format(x$timestamp, format="%U")` and `format(x$timestamp, format="%A")` might give you what you need, no joins. – r2evans Mar 26 '19 at 02:54
  • If you've got requirements for what answers you won't accept, if would be easier to help with a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and at least a sense of what you've tried so far – camille Mar 26 '19 at 03:15

2 Answers2

1

Using SQL one can join two tables on a range like this. This seems the most elegant solution expressing our intent directly but we also provide some alternatives further below.

library(sqldf)

DF1$date <- as.Date(DF1$timestamp)

sqldf("select * 
  from DF1 a 
  left join DF2 b on date between week_start and week_end")

giving:

  user_id           timestamp       date week_id week_start   week_end
1      aa 2018-01-01 12:01:00 2018-01-01       1 2018-01-01 2018-01-07
2      ab 2018-01-01 05:01:00 2018-01-01       1 2018-01-01 2018-01-07
3      bb 2018-06-01 09:01:00 2018-06-01      NA       <NA>       <NA>
4      bc 2018-03-03 23:01:00 2018-03-04      NA       <NA>       <NA>
5      cc 2018-01-02 11:01:00 2018-01-02       1 2018-01-01 2018-01-07

dplyr

In a comment the poster asked for whether it could be done in dplyr. It can't be done directly since dplyr does not support complex joins but a workaound would be to do a full cross join of the two data frames which gives rise to an nrow(DF1) * nrow(DF2) intermediate result and then filter this down. dplyr does not directly support cross joins but we can simulate one by doing a full join on an identical dummy constant column that is appended to both data frames in the full join. Since we actually need a right join here to add back the unmatched rows, we do a final right join with the original DF1 data frame. Obviously this is entirely impractical for sufficiently large inputs but for the small input here we can do it. If it were known that there is a match in DF2 to every row in DF1 then the right_join at the end could be omitted.

DF1 %>% 
  mutate(date = as.Date(timestamp), dummy = 1) %>%
  full_join(DF2 %>% mutate(dummy = 1)) %>%
  filter(date >= week_start & date <= week_end) %>%
  select(-dummy) %>%
  right_join(DF1)

R Base

findix finds the index in DF2 corresponding to a date d. We then sapply it over the dates corresponding to rows of DF1 and put DF1 and the corresponding DF2 row together.

findix <- function(d) c(which(d >= DF2$week_start & d <= DF2$week_end), NA)[1]
cbind(DF1, DF2[sapply(as.Date(DF1$timestamp), findix), ])

Note

The input data in reproducible form used is:

Lines1 <- "user_id     timestamp
aa          2018-01-01 12:01 UTC
ab          2018-01-01 05:01 UTC
bb          2018-06-01 09:01 UTC
bc          2018-03-03 23:01 UTC
cc          2018-01-02 11:01 UTC"
DF1 <- read.csv(text = gsub("     +", ",", Lines1), strip.white = TRUE)
DF1$timestamp <- as.POSIXct(DF1$timestamp)

Lines2 <- "week_id    week_start     week_end
1          2018-01-01     2018-01-07
2          2018-01-08     2018-01-15
3          2018-01-16     2018-01-23
4          2018-01-23     2018-01-30"
DF2 <- read.table(text = Lines2, header = TRUE)
DF2$week_start <- as.Date(DF2$week_start)
DF2$week_end <- as.Date(DF2$week_end)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • This is great, thank you. Is it possible to do this with the joins available in `dplyr`? – Cauder Mar 26 '19 at 03:34
  • I don't think dplyr supports complex joins directly but I have added an example of doing it using a workaround. Note that this may be impractical given that it creates a huge intermediate result. – G. Grothendieck Mar 26 '19 at 11:51
0

This is a case for the fuzzyjoin-package. With the match_fun- argument we can specify conditions for each column. In this case table1$date >= table2$week_start and table1$date <= table2$week_end.

library(fuzzyjoin)
library(lubridate)

table1$date <- as.Date(table1$timestamp)

fuzzy_left_join(table1, table2, 
                by = c("date" = "week_start", "date" = "week_end"),
                match_fun = list(`>=`, `<=`)) %>%
  mutate(day_of_week = wday(date, label = TRUE)) %>%
  select(user_id, timestamp, week_id, day_of_week) 

  user_id        timestamp week_id day_of_week
1      aa 2018-01-01 12:01       1          Mo
2      ab 2018-01-01 05:01       1          Mo
3      bb 2018-06-01 09:01      22          Fr
4      bc 2018-03-03 23:01       9          Sa
5      cc 2018-01-02 11:01       1          Di

I'm also a smartass because I didn't use the weekday-function but wday from the lubridate-package.

Humpelstielzchen
  • 6,126
  • 3
  • 14
  • 34