Python or R - Concat/Appending rows like an inner join

Question

I have n data frames that represent weekly time periods. I'd like to do something like an inner join (based on id and id2), but for appending rows from all n datasets, not additional columns (as they're all identical).

They all look like this

DF1:
id   id2    A      B     C     PERIOD
1     50   0.1    0.2    0.3     1
1    100   0.1    0.2    0.3     1
2     2    0.1    0.2    0.3     1

DF2:
id   id2    A      B     C     PERIOD
1     50   0.5    0.7    0.9     2
1    100   0.6    0.8    0.9     2
1    105   0.1    0.2    0.3     2
2     2    0.3    0.4    0.5     2
2     3    0.1    0.2    0.3     2

...repeats for DFn

I'd like a data frame that looks like this

id   id2    A      B     C     PERIOD
1     50   0.1    0.2    0.3     1
1     50   0.5    0.7    0.9     2
...                              n

1    100   0.1    0.2    0.3     1
1    100   0.6    0.8    0.9     2
...                              n

2     2    0.1    0.2    0.3     1
2     2    0.3    0.4    0.5     2
...                              n

So it discards all combinations of id, id2 that don't appear in ALL of my n datasets. Is there a quick way of doing this?

I was thinking of first going over all n data frames, grabbing sets of pairs id, id2, and doing an intersection of all these sets, and then reduce the data frames by .isin, and then pd.concat on the list of reduced data frames. This seems tedious though.

As I can do it in tidyverse or pandas, doesn't really matter — Mihovil Mandic, Jan 07 '22 at 20:58
I'd suggest concat all and then https://stackoverflow.com/questions/49735683/python-removing-rows-on-count-condition. Use id, id2 for grouping and set count condition to n — MYousefi, Jan 07 '22 at 21:05
Smart! I forgot I can clean up duplicates to ensure the different n rows come from n dataframes. Thanks — Mihovil Mandic, Jan 07 '22 at 21:09

r2evans · Answer 1 · 2022-01-07T21:08:59.673

Reduce(function(a, b) {
  rbind(
    merge(a, b[,c("id","id2")], by = c("id", "id2")),
    merge(b, a[,c("id","id2")], by = c("id", "id2"))
  )
}, list(DF1, DF2))
#   id id2   A   B   C PERIOD
# 1  1 100 0.1 0.2 0.3      1
# 2  1  50 0.1 0.2 0.3      1
# 3  2   2 0.1 0.2 0.3      1
# 4  1 100 0.6 0.8 0.9      2
# 5  1  50 0.5 0.7 0.9      2
# 6  2   2 0.3 0.4 0.5      2

If you don't want to merge twice, then you can use:

Reduce(function(a, b) {
  tmp <- merge(a, b, by = c("id", "id2"), all = TRUE)
  tmp <- tmp[complete.cases(tmp),]
  tmpx <- tmp[,c("id", "id2", grep("\\.x$", colnames(tmp), value = TRUE))]
  colnames(tmpx) <- gsub("\\.x$", "", colnames(tmpx))
  tmpy <- tmp[,c("id", "id2", grep("\\.y$", colnames(tmp), value = TRUE))]
  colnames(tmpy) <- gsub("\\.y$", "", colnames(tmpy))
  rbind(tmpx, tmpy)
}, list(DF1, DF2))
#    id id2   A   B   C PERIOD
# 1   1  50 0.1 0.2 0.3      1
# 2   1 100 0.1 0.2 0.3      1
# 4   2   2 0.1 0.2 0.3      1
# 11  1  50 0.5 0.7 0.9      2
# 21  1 100 0.6 0.8 0.9      2
# 41  2   2 0.3 0.4 0.5      2

Implementing MYousefi's suggested logic, here's a third alternative:

bind_rows(DF1, DF2) %>%
  group_by(id, id2) %>%
  filter(n() == 2L) %>%  # 2 is the number of frames joined
  ungroup()
# # A tibble: 6 x 6
#      id   id2     A     B     C PERIOD
#   <int> <int> <dbl> <dbl> <dbl>  <int>
# 1     1    50   0.1   0.2   0.3      1
# 2     1   100   0.1   0.2   0.3      1
# 3     2     2   0.1   0.2   0.3      1
# 4     1    50   0.5   0.7   0.9      2
# 5     1   100   0.6   0.8   0.9      2
# 6     2     2   0.3   0.4   0.5      2

score 1 · Answer 2 · answered Jan 07 '22 at 21:05

We could use group_split from dplyr package R: You will get a list!

library(dplyr)

bind_rows(DF1, DF2) %>% 
  group_split(id, id2)

[[1]]
# A tibble: 2 x 6
     id   id2     A     B     C PERIOD
  <int> <int> <dbl> <dbl> <dbl>  <int>
1     1    50   0.1   0.2   0.3      1
2     1    50   0.5   0.7   0.9      2

[[2]]
# A tibble: 2 x 6
     id   id2     A     B     C PERIOD
  <int> <int> <dbl> <dbl> <dbl>  <int>
1     1   100   0.1   0.2   0.3      1
2     1   100   0.6   0.8   0.9      2

[[3]]
# A tibble: 1 x 6
     id   id2     A     B     C PERIOD
  <int> <int> <dbl> <dbl> <dbl>  <int>
1     1   105   0.1   0.2   0.3      2

[[4]]
# A tibble: 2 x 6
     id   id2     A     B     C PERIOD
  <int> <int> <dbl> <dbl> <dbl>  <int>
1     2     2   0.1   0.2   0.3      1
2     2     2   0.3   0.4   0.5      2

[[5]]
# A tibble: 1 x 6
     id   id2     A     B     C PERIOD
  <int> <int> <dbl> <dbl> <dbl>  <int>
1     2     3   0.1   0.2   0.3      2

If you want to get a dataframe: you could do:

library(dplyr)
bind_rows(DF1, DF2) %>% 
  group_split(id, id2) %>% 
  bind_rows()

or simply:

library(dplyr)

bind_rows(DF1, DF2) %>% 
  arrange(id, id2)

     id   id2     A     B     C PERIOD
  <int> <int> <dbl> <dbl> <dbl>  <int>
1     1    50   0.1   0.2   0.3      1
2     1    50   0.5   0.7   0.9      2
3     1   100   0.1   0.2   0.3      1
4     1   100   0.6   0.8   0.9      2
5     1   105   0.1   0.2   0.3      2
6     2     2   0.1   0.2   0.3      1
7     2     2   0.3   0.4   0.5      2
8     2     3   0.1   0.2   0.3      2

Python or R - Concat/Appending rows like an inner join

2 Answers2