R: Group and merge datatables without resulting in dataframe

Question

I have two data tables of the form: param1 param2 param3... obv1_i obv2_i obv3_i.... That is, they share some of their first columns(the parameters) and and differ on the rests(the observations). Every set of params appears in both tables(the same amount of times). As the title says I want to merge the two tables. What I have done so far:

library(dplyr)
dt1 = dt1 %>% group_by(param1, param2,...)
dt2 = dt2 %>% group_by(param1, param2,...)
dt  = merge(dt1, dt2)

The problem is that originally the dt1 class is data.table data.frame, after the grouping it is grouped_df tbl_df tbl data.frame and after the merging it is just data.frame. But I would prefer to work with data.table. So far I have found that the data table way of merging is dt = dt1[dt2] but it gives me the error: Must subset columns with a valid subscript vector.

So is it possible to do it without ending up with a data frame? Thanks in advance.

EDIT: Minimal example

dt1 = data.table(
    prm1 = c(1, 1, 2, 2, 3, 3),
    prm2 = c(1, 1, 2, 2, 3, 3),
    obs1 = 1:6,
    obs2 = 7:12)

dt2 = data.table(
    prm1 = c(1, 1, 2, 2, 3, 3),
    prm2 = c(1, 1, 2, 2, 3, 3),
    obs3 = 13:18,
    obs4 = 19:24)
# Now what dt should be
dt = data.table(
   prm1 = c(1, 1, 2, 2, 3, 3),
   prm2 = c(1, 1, 2, 2, 3, 3),
   obs1 = 1:6,
   obs2 = 7:12,
   obs3 = 13:18,
   obs4 = 19:24)

There is `merge` function for data.table which would maintain the class as `data.table`. I am not sure why you need `group_by` first to merge tables. It would be helpful if you provide us with concrete example and corresponding expected output instead of making us guess what you want. — Ronak Shah, Dec 26 '20 at 13:56
@RonakShah How will the merge work without the groups? If I have some parameters to be let's say 2, 5 and the obs to be 3.1 in dt1 and 2.3 in dt2 I want the merged dt to have the row: [2, 5, 3.1, 2.3]. Without the group the result I think will be [2, 5, 3.1, 2, 5, 2.3]. — cgss, Dec 26 '20 at 14:24
You should look at data.table vignettes. Alternatively, this is a decent, if outdated, StackOverflow question https://stackoverflow.com/q/1299871/11355066. — Cole, Dec 26 '20 at 16:52
The concept of `merge`/`join` is innately grouped by the fields you suggest as the `by=` (or `on=`) firlds. — r2evans, Dec 26 '20 at 16:52
@Cole, good link. I also like https://stackoverflow.com/a/6188334/3358272 as a visualization of the merge/join process. — r2evans, Dec 26 '20 at 16:53
@r2evans In case it was not clear from the question, my join case is very special in the sense that all joins will produce the same result. Params are common between the two dt's and take the same values. Think like having two sensors that measure different quantities in the same experiment. Params are time, room temperature etc so they are identical, and obs are the quantities being measured. — cgss, Dec 26 '20 at 17:58
Okay, and ... I don't see how it changes anything. But if my answer doesn't work for you, then I think you need to provide a reproducible question with sample data and expected output given that data. It doesn't need to be big, just varied-enough to demonstrate why you think group-based joining is needed. — r2evans, Dec 26 '20 at 18:27
It doesn't. I will try to create a minimal dataset and update the question. — cgss, Dec 26 '20 at 18:29

r2evans · Accepted Answer · 2020-12-26T20:36:42.800

Update:

Here are two ways to go from your dt1 and dt2 to your dt:

cbind(dt1, dt2[,3:4])
#    prm1 prm2 obs1 obs2 obs3 obs4
# 1:    1    1    1    7   13   19
# 2:    1    1    2    8   14   20
# 3:    2    2    3    9   15   21
# 4:    2    2    4   10   16   22
# 5:    3    3    5   11   17   23
# 6:    3    3    6   12   18   24

That is obviously very sensitive to the number of rows, and will fail without much effort.

An alternative is to add a "row-number within a group" (using your group assumption), and include that in the join parameters.

dt1[,n := seq_len(.N), by = .(prm1, prm2)]
dt2[,n := seq_len(.N), by = .(prm1, prm2)]
merge(dt1, dt2, by = c("prm1", "prm2", "n"))
#    prm1 prm2 n obs1 obs2 obs3 obs4
# 1:    1    1 1    1    7   13   19
# 2:    1    1 2    2    8   14   20
# 3:    2    2 1    3    9   15   21
# 4:    2    2 2    4   10   16   22
# 5:    3    3 1    5   11   17   23
# 6:    3    3 2    6   12   18   24

This is a complete inference, that row-number within a prm1/prm2 group is meaninful.

If neither of these work with the real data, then either (a) there is a bit of post-merge filtering that needs to be done (contextual, so I don't know), or (b) we have a problem. The problem with the same merge but without n is that each group has more than 1 row, on either or both sides, meaning there will be a cartesian expansion.

Don't group_by your data, this is doing two things:

Changing the class from data.table to grouped_df, tbl_df. This is because group_by is a dplyr function that operates by adding an attribute to the frame that indicates which rows belong to each group. Only tbl_df-based functions honor this attribute, so it needs to change the class from data.table to tbl_df (with grouped_df, etc).
It is doing something for no reason, so it is wasting time (though really not much). The theory behind the merge/join is that the frames will be column-wise combined based on the keys. Your "grouping" is intended (I think) to ensure that only matching param* variables are joined, when in fact the way to think of it is in the parlance of joins, typically from one of: full, semi, left, right, inner, or anti. All of there are natively supported in dplyr with well-named functions; most are enabled directly in base::merge and data.table::merge.

Two links are really good at explaining and visualizing what is going on with the various types of merge: How to join (merge) data frames (inner, outer, left, right) and https://stackoverflow.com/a/6188334/3358272.

The default behavior of base::merge (and, though not documented as such, data.table::merge as well) in the absence of an explicit by= argument (or by.x/by.y) is to infer the columns based on intersect(names(x), names(y)), which is very often the desired behavior. I discourage this in programmatic use, though, as it can lead to mistakes when data is not always shaped/named perfectly. (The dplyr join verbs all provide messages when inference is made.)

If we start with your original not-grouped (and therefore still-data.table) dt1 and dt2 objects, then we should be able to do one of the following, preserving the data.table class:

# inferential "by" columns, not great
merge(dt1, dt2)

# default behavior, now explicit
merge(dt1, dt2, by = intersect(names(dt1), names(dt2)))

# slightly better: it will error if any params are in x and not y, our assumption
merge(dt1, dt2, by = grep("^param", names(dt1), value = TRUE))

# data.table-esque, same left-join (no full-join in this format)
dt1[dt2, on = intersect(names(dt1), names(dt2))]

One good reference for data.table joins: https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html.

See the update answer. With the toy example all 4 of your versions run but don't produce the correct output. — cgss, Dec 26 '20 at 18:44
Thank you very much. It works well. I would add a `n := NULL` just for completeness. You are right. Without n there is a cartesian expansion. The whole point of my grouping was to take advantage of the 1-1 nature of the extra number of rows to avoid it. — cgss, Dec 26 '20 at 21:50

R: Group and merge datatables without resulting in dataframe

1 Answers1