Update:
Here are two ways to go from your dt1
and dt2
to your dt
:
cbind(dt1, dt2[,3:4])
# prm1 prm2 obs1 obs2 obs3 obs4
# 1: 1 1 1 7 13 19
# 2: 1 1 2 8 14 20
# 3: 2 2 3 9 15 21
# 4: 2 2 4 10 16 22
# 5: 3 3 5 11 17 23
# 6: 3 3 6 12 18 24
That is obviously very sensitive to the number of rows, and will fail without much effort.
An alternative is to add a "row-number within a group" (using your group assumption), and include that in the join parameters.
dt1[,n := seq_len(.N), by = .(prm1, prm2)]
dt2[,n := seq_len(.N), by = .(prm1, prm2)]
merge(dt1, dt2, by = c("prm1", "prm2", "n"))
# prm1 prm2 n obs1 obs2 obs3 obs4
# 1: 1 1 1 1 7 13 19
# 2: 1 1 2 2 8 14 20
# 3: 2 2 1 3 9 15 21
# 4: 2 2 2 4 10 16 22
# 5: 3 3 1 5 11 17 23
# 6: 3 3 2 6 12 18 24
This is a complete inference, that row-number within a prm1/prm2
group is meaninful.
If neither of these work with the real data, then either (a) there is a bit of post-merge
filtering that needs to be done (contextual, so I don't know), or (b) we have a problem. The problem with the same merge but without n
is that each group has more than 1 row, on either or both sides, meaning there will be a cartesian expansion.
Don't group_by
your data, this is doing two things:
Changing the class from data.table
to grouped_df, tbl_df
. This is because group_by
is a dplyr
function that operates by adding an attribute to the frame that indicates which rows belong to each group. Only tbl_df
-based functions honor this attribute, so it needs to change the class from data.table
to tbl_df
(with grouped_df
, etc).
It is doing something for no reason, so it is wasting time (though really not much). The theory behind the merge/join is that the frames will be column-wise combined based on the keys. Your "grouping" is intended (I think) to ensure that only matching param*
variables are joined, when in fact the way to think of it is in the parlance of joins, typically from one of: full, semi, left, right, inner, or anti. All of there are natively supported in dplyr
with well-named functions; most are enabled directly in base::merge
and data.table::merge
.
Two links are really good at explaining and visualizing what is going on with the various types of merge: How to join (merge) data frames (inner, outer, left, right) and https://stackoverflow.com/a/6188334/3358272.
The default behavior of base::merge
(and, though not documented as such, data.table::merge
as well) in the absence of an explicit by=
argument (or by.x
/by.y
) is to infer the columns based on intersect(names(x), names(y))
, which is very often the desired behavior. I discourage this in programmatic use, though, as it can lead to mistakes when data is not always shaped/named perfectly. (The dplyr
join verbs all provide messages when inference is made.)
If we start with your original not-grouped (and therefore still-data.table
) dt1
and dt2
objects, then we should be able to do one of the following, preserving the data.table
class:
# inferential "by" columns, not great
merge(dt1, dt2)
# default behavior, now explicit
merge(dt1, dt2, by = intersect(names(dt1), names(dt2)))
# slightly better: it will error if any params are in x and not y, our assumption
merge(dt1, dt2, by = grep("^param", names(dt1), value = TRUE))
# data.table-esque, same left-join (no full-join in this format)
dt1[dt2, on = intersect(names(dt1), names(dt2))]
One good reference for data.table
joins: https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html.