1

I have n data frames that represent weekly time periods. I'd like to do something like an inner join (based on id and id2), but for appending rows from all n datasets, not additional columns (as they're all identical).

They all look like this

DF1:
id   id2    A      B     C     PERIOD
1     50   0.1    0.2    0.3     1
1    100   0.1    0.2    0.3     1
2     2    0.1    0.2    0.3     1

DF2:
id   id2    A      B     C     PERIOD
1     50   0.5    0.7    0.9     2
1    100   0.6    0.8    0.9     2
1    105   0.1    0.2    0.3     2
2     2    0.3    0.4    0.5     2
2     3    0.1    0.2    0.3     2

...repeats for DFn

I'd like a data frame that looks like this

id   id2    A      B     C     PERIOD
1     50   0.1    0.2    0.3     1
1     50   0.5    0.7    0.9     2
...                              n

1    100   0.1    0.2    0.3     1
1    100   0.6    0.8    0.9     2
...                              n

2     2    0.1    0.2    0.3     1
2     2    0.3    0.4    0.5     2
...                              n

So it discards all combinations of id, id2 that don't appear in ALL of my n datasets. Is there a quick way of doing this?

I was thinking of first going over all n data frames, grabbing sets of pairs id, id2, and doing an intersection of all these sets, and then reduce the data frames by .isin, and then pd.concat on the list of reduced data frames. This seems tedious though.

r2evans
  • 141,215
  • 6
  • 77
  • 149

2 Answers2

2
Reduce(function(a, b) {
  rbind(
    merge(a, b[,c("id","id2")], by = c("id", "id2")),
    merge(b, a[,c("id","id2")], by = c("id", "id2"))
  )
}, list(DF1, DF2))
#   id id2   A   B   C PERIOD
# 1  1 100 0.1 0.2 0.3      1
# 2  1  50 0.1 0.2 0.3      1
# 3  2   2 0.1 0.2 0.3      1
# 4  1 100 0.6 0.8 0.9      2
# 5  1  50 0.5 0.7 0.9      2
# 6  2   2 0.3 0.4 0.5      2

If you don't want to merge twice, then you can use:

Reduce(function(a, b) {
  tmp <- merge(a, b, by = c("id", "id2"), all = TRUE)
  tmp <- tmp[complete.cases(tmp),]
  tmpx <- tmp[,c("id", "id2", grep("\\.x$", colnames(tmp), value = TRUE))]
  colnames(tmpx) <- gsub("\\.x$", "", colnames(tmpx))
  tmpy <- tmp[,c("id", "id2", grep("\\.y$", colnames(tmp), value = TRUE))]
  colnames(tmpy) <- gsub("\\.y$", "", colnames(tmpy))
  rbind(tmpx, tmpy)
}, list(DF1, DF2))
#    id id2   A   B   C PERIOD
# 1   1  50 0.1 0.2 0.3      1
# 2   1 100 0.1 0.2 0.3      1
# 4   2   2 0.1 0.2 0.3      1
# 11  1  50 0.5 0.7 0.9      2
# 21  1 100 0.6 0.8 0.9      2
# 41  2   2 0.3 0.4 0.5      2

Implementing MYousefi's suggested logic, here's a third alternative:

bind_rows(DF1, DF2) %>%
  group_by(id, id2) %>%
  filter(n() == 2L) %>%  # 2 is the number of frames joined
  ungroup()
# # A tibble: 6 x 6
#      id   id2     A     B     C PERIOD
#   <int> <int> <dbl> <dbl> <dbl>  <int>
# 1     1    50   0.1   0.2   0.3      1
# 2     1   100   0.1   0.2   0.3      1
# 3     2     2   0.1   0.2   0.3      1
# 4     1    50   0.5   0.7   0.9      2
# 5     1   100   0.6   0.8   0.9      2
# 6     2     2   0.3   0.4   0.5      2
r2evans
  • 141,215
  • 6
  • 77
  • 149
1

We could use group_split from dplyr package R: You will get a list!

library(dplyr)

bind_rows(DF1, DF2) %>% 
  group_split(id, id2)
[[1]]
# A tibble: 2 x 6
     id   id2     A     B     C PERIOD
  <int> <int> <dbl> <dbl> <dbl>  <int>
1     1    50   0.1   0.2   0.3      1
2     1    50   0.5   0.7   0.9      2

[[2]]
# A tibble: 2 x 6
     id   id2     A     B     C PERIOD
  <int> <int> <dbl> <dbl> <dbl>  <int>
1     1   100   0.1   0.2   0.3      1
2     1   100   0.6   0.8   0.9      2

[[3]]
# A tibble: 1 x 6
     id   id2     A     B     C PERIOD
  <int> <int> <dbl> <dbl> <dbl>  <int>
1     1   105   0.1   0.2   0.3      2

[[4]]
# A tibble: 2 x 6
     id   id2     A     B     C PERIOD
  <int> <int> <dbl> <dbl> <dbl>  <int>
1     2     2   0.1   0.2   0.3      1
2     2     2   0.3   0.4   0.5      2

[[5]]
# A tibble: 1 x 6
     id   id2     A     B     C PERIOD
  <int> <int> <dbl> <dbl> <dbl>  <int>
1     2     3   0.1   0.2   0.3      2

If you want to get a dataframe: you could do:

library(dplyr)
bind_rows(DF1, DF2) %>% 
  group_split(id, id2) %>% 
  bind_rows()

or simply:

library(dplyr)

bind_rows(DF1, DF2) %>% 
  arrange(id, id2)
     id   id2     A     B     C PERIOD
  <int> <int> <dbl> <dbl> <dbl>  <int>
1     1    50   0.1   0.2   0.3      1
2     1    50   0.5   0.7   0.9      2
3     1   100   0.1   0.2   0.3      1
4     1   100   0.6   0.8   0.9      2
5     1   105   0.1   0.2   0.3      2
6     2     2   0.1   0.2   0.3      1
7     2     2   0.3   0.4   0.5      2
8     2     3   0.1   0.2   0.3      2
TarJae
  • 72,363
  • 6
  • 19
  • 66