Join the two dataframe by conditions

Question

I have a datasets collected by alternative auto-time setting. This is the example format.

file = data.table(ID = c(1:19), time = as.hms("6:01:00 ", "6:03:00 ", 
                                                  "6:05:00 ", "6:08:00 ",
                                                  "6:10:00 ", "6:12:00 ",
                                                  "6:15:00 ", "6:17:00 ",
                                                  "6:19:00 ", "6:22:00 ",
                                                  "6:24:00 ", "6:26:00 ",
                                                  "6:29:00 "), 
                  Turn = c("A", "A", "B", "B", "C", "C", "A", "A", "A", "B", 
                           "B", "C", "C"),
                  Value = c("16.00", "18.00", "17.00", "19.00", "21.00",
                            "26.00", "25.60", "27.34", "29.09", "30.83",
                            "32.57", "34.31", "36.06"))

From this dataset, I separated it into three dataframes for each turn by using the following code:

Data_A <- file[file$Turn %in% c("A"), ]
Data_B <- file[file$Turn %in% c("B"), ]
Data_C <- file[file$Turn %in% c("C"), ]


Data_A
ID  Time    Turn    Value
1   6:01:00 A   16.00
2   6:03:00 A   18.00
7   6:15:00 A   25.60
8   6:17:00 A   27.34
9   6:19:00 A   29.09

Data_C
ID  Time    Turn    Value
5   6:10:00 C   21.00
6   6:12:00 C   26.00
12  6:26:00 C   34.31
13  6:29:00 C   36.06

I want to combine Data_A and Data_C by the nearest time as follow:

ID  Time    Turn    Value Time C Turn C Value C
1   6:01:00 A   16.00  6:10:00  C   21.00
2   6:03:00 A   18.00  6:12:00  C   26.00
7   6:15:00 A   25.60  6:26:00  C   34.31
8   6:17:00 A   27.34  6:29:00  C   36.06
9   6:19:00 A   29.09

As you see, there is an extra A in Data_A. I will exclude it. There are so many such irregular data in the time series dataset. Can anyone help me how to code for this result? Thanks!

So there should be only one matched Time of Data_C in the join? — Maël, Jan 04 '23 at 08:46
Why does `ID` have 19 values and all other columns have only 13? — r2evans, Jan 04 '23 at 12:49

r2evans · Answer 1 · 2023-01-05T10:26:29.753

In order to combine two frames like this, you need to ensure they both have the same number of rows. This means that (for instance) Data_C will need to have a row of NA values appended. This is fairly straight-forward:

maxrows <- max(nrow(Data_A), nrow(Data_B), nrow(Data_C))
Data_C <- do.call(rbind, c(list(Data_C),
                           replicate(maxrows - nrow(Data_C),
                                     Data_C[1,][NA,],
                                     simplify = FALSE)))
names(Data_C) <- paste0(names(Data_C), "_C")
cbind(Data_A, Data_C)
#       ID     time   Turn  Value  ID_C   time_C Turn_C Value_C
#    <int>    <hms> <char> <char> <int>    <hms> <char>  <char>
# 1:     1 06:01:00      A  16.00     5 06:01:00      C   21.00
# 2:     2 06:01:00      A  18.00     6 06:01:00      C   26.00
# 3:     7 06:01:00      A  25.60    12 06:01:00      C   34.31
# 4:     8 06:01:00      A  27.34    13 06:01:00      C   36.06
# 5:     9 06:01:00      A  29.09    NA       NA   <NA>    <NA>

The use of Data_C[1,][NA,] is meant to get a single row where each value is the appropriate form of NA (there are at least six distinct classes of NA).

Side thoughts:

It's generally better to keep the data all in one frame instead of multiple frames, that way grouping operations are defined in one place and can be done easily using dplyr::group_by, base::ave (and other grouping functions), and data.table's by= argument.
Even if you do split a frame into multiple frames, it is often better to keep them as a list of frames, since what you often do to one is easily done to all using lapply. One could generate this easily with spl <- split(file, file$Turn). From there, reference sdl$C instead of your Data_C.

This might be done more efficiently on file as a pivot.

library(dplyr)
library(tidyr)
file %>%
  group_by(Turn) %>%
  mutate(rn = row_number()) %>%
  ungroup() %>%
  pivot_wider("rn", names_from = "Turn", values_from = c("ID", "time", "Value")) %>%
  select(-rn)
# # A tibble: 5 × 9
#    ID_A  ID_B  ID_C time_A time_B time_C Value_A Value_B Value_C
#   <int> <int> <int> <time> <time> <time> <chr>   <chr>   <chr>  
# 1     1     3     5 06:01  06:01  06:01  16.00   17.00   21.00  
# 2     2     4     6 06:01  06:01  06:01  18.00   19.00   26.00  
# 3     7    10    12 06:01  06:01  06:01  25.60   30.83   34.31  
# 4     8    11    13 06:01  06:01  06:01  27.34   32.57   36.06  
# 5     9    NA    NA 06:01     NA     NA  29.09   NA      NA

or with data.table:

file[, rn := seq_len(.N), by = Turn]
dcast(file, rn ~ Turn, value.var = c("ID", "time", "Value"))[, rn := NULL][]
#     ID_A  ID_B  ID_C   time_A   time_B   time_C Value_A Value_B Value_C
#    <int> <int> <int>    <hms>    <hms>    <hms>  <char>  <char>  <char>
# 1:     1     3     5 06:01:00 06:01:00 06:01:00   16.00   17.00   21.00
# 2:     2     4     6 06:01:00 06:01:00 06:01:00   18.00   19.00   26.00
# 3:     7    10    12 06:01:00 06:01:00 06:01:00   25.60   30.83   34.31
# 4:     8    11    13 06:01:00 06:01:00 06:01:00   27.34   32.57   36.06
# 5:     9    NA    NA 06:01:00       NA       NA   29.09    <NA>    <NA>

Ordering of columns should be relatively straight-forward.

Thanks, r2evans! Your code seems to work for the values, but I wonder why the time changed like this. I want to keep the time as original. Any idea??? Thanks. — Heiwa, Jan 05 '23 at 08:59
My guess is that the real value is really the same. Tidyverse packages tend to prefer fancy rendering of values (using `pillar`) in custom ways, I suspect it's just a size reduction of what is printed but no change in the numeric value. — r2evans, Jan 05 '23 at 10:29
Thanks again, r2evans! I tested the code based on your suggestions. library(dplyr) library(tidyr) file %>% group_by(Turn) %>% mutate(rn = row_number()) %>% ungroup() %>% pivot_wider("rn", names_from = "Turn", values_from = c("ID", "time", "Value")) %>% select(-rn) It works to some extent, but the value for ID_A 9 matched the values for the following ID_B and ID_C in the remaining time series. Again, those values are linked with the ID_A 14, and so on. — Heiwa, Jan 11 '23 at 08:28

score 1 · Answer 2 · answered Jan 04 '23 at 15:46

You could use data.table's roll='nearest' join option.
In order to do so, time needs to be converted to numeric:

library(data.table)
library(hms)

file = data.table(ID = c(1:19), time = lapply(c("6:01:00", "6:03:00","6:05:00", "6:08:00 ","6:10:00 ", "6:12:00 ", "6:15:00 ", "6:17:00 ", "6:19:00 ", "6:22:00 ",       "6:24:00 ", "6:26:00 ", "6:29:00 "),as.hms), 
                  Turn = c("A", "A", "B", "B", "C", "C", "A", "A", "A", "B", 
                           "B", "C", "C"),
                  Value = c("16.00", "18.00", "17.00", "19.00", "21.00",
                            "26.00", "25.60", "27.34", "29.09", "30.83",
                            "32.57", "34.31", "36.06"))


# numeric join field
file[,itime:=as.numeric(time)]

Data_A <- file[Turn %in% c("A"), ]
Data_B <- file[Turn %in% c("B"), ]
Data_C <- file[Turn %in% c("C"), ]

Data_C[Data_A,.(ID = i.ID,
                time = i.time,
                Value = i.Value,
                Turn = i.Turn,
                time_C = x.time,
                Value_C = x.Value,
                Turn_C = x.Turn),on=.(itime),roll='nearest']
#>       ID     time  Value   Turn   time_C Value_C Turn_C
#>    <int>   <list> <char> <char>   <list>  <char> <char>
#> 1:     1 06:01:00  16.00      A 06:10:00   21.00      C
#> 2:     2 06:03:00  18.00      A 06:10:00   21.00      C
#> 3:     7 06:15:00  25.60      A 06:12:00   26.00      C
#> 4:     8 06:17:00  27.34      A 06:12:00   26.00      C
#> 5:     9 06:19:00  29.09      A 06:12:00   26.00      C
#> 6:    14 06:01:00  16.00      A 06:10:00   21.00      C
#> 7:    15 06:03:00  18.00      A 06:10:00   21.00      C

Thanks, Waldi! I tried your code, but it gave the repeated values of Data_C. Actually, I don't want to repeat the same values of Data_C. Anyway, thanks for your kindness. — Heiwa, Jan 05 '23 at 08:58

RYann · Answer 3 · 2023-01-04T09:14:00.487

I would try something in that logic:

First, arrange both dataset by descending order of time, using the arrange() function from the tidyverse packages.

Now that the two datas are arranged the same way,

Join them together according to dataset C using the left_join() function (also from tidyverse).

library(tidyverse)
A = arrange(Data_A, Time)
C = arrange(Data_C, Time)

left_join(A,C)

I tend to confuse left, right or inner join. Please make sure you get your desired result and change to either inner or right (there's also full/anti join, you'll figure out what each does if you play with it. It's basically a venn diagram of datasets).

To use the tidyverse packages use install.packages("tidyverse") if you don't have them installed.

Thanks, RYann! I tested that code before, but it gave me the undesired result. One value of Data_A joined with several values of Data_C. But I want to keep only one-to-one from Data_A and Data_C. Thanks indeed. — Heiwa, Jan 05 '23 at 09:01

Join the two dataframe by conditions

3 Answers3