filter items from one column based on shared items in another R

Question

I have a table were each sample has a unique identifier but also a section identifier. I want to extract all vs all distance comparisons for each section (this data comes from a second table)

eg table 1

Sample    Section
1         1
2         1
3         1
4         2
5         2
6         3

table 2

sample    sample    distance
1         2         10
1         3         1
1         4         2
2         3         5
2         4         10
3         4         11

so my desired output is a list which has distance for: [1 vs 2], [1 vs 3], [2 vs 3], [4 vs 5] - ie all distance comparisons from table two for samples which share a section in table 1

I started trying to do this with nested for loops, but it quickly got messy.. Any ideas of a neat way to do this?

With the output from `dput(...)` we can reproduce your data, e.g. `dput(BOD)`. The data can exactly reproduced by `B <- ..... # (output from dput(...))` So, put the output from `dput(...)` in your question, please read: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example Alternative you can give a definition of your dataframe, i.e. `table1 <- data.frame(...)` . — jogo, Feb 16 '18 at 14:14

score 1 · Accepted Answer · answered Dec 29 '17 at 12:31

A solution using dplyr.

We can first create a data frame showing the combination of samples in each section.

library(dplyr)

table1_cross <- full_join(table1, table1, by = "Section") %>%    # Full join by Section
  filter(Sample.x != Sample.y) %>%                               # Remove records with same samples
  rowwise() %>%
  mutate(Sample.all = toString(sort(c(Sample.x, Sample.y)))) %>% # Create a column showing the combination between Sample.x and Sample.y
  ungroup() %>%
  distinct(Sample.all, .keep_all = TRUE) %>%                     # Remove duplicates in Sample.all
  select(Sample1 = Sample.x, Sample2 = Sample.y, Section)
table1_cross
# # A tibble: 4 x 3
#   Sample1 Sample2 Section
#     <int>   <int>   <int>
# 1       1       2       1
# 2       1       3       1
# 3       2       3       1
# 4       4       5       2

We can then filter table2 by table1_cross. table3 is the final output.

table3 <- table2 %>%                                     
  semi_join(table1_cross, by = c("Sample1", "Sample2")) # Filter table2 based on table1_corss

table3
#   Sample1 Sample2 distance
# 1       1       2       10
# 2       1       3        1
# 3       2       3        5

DATA

table1 <- read.table(text = "Sample    Section
1         1
                     2         1
                     3         1
                     4         2
                     5         2
                     6         3",
                     header = TRUE, stringsAsFactors = FALSE)

table2 <- read.table(text = "Sample1    Sample2    distance
1         2         10
                     1         3         1
                     1         4         2
                     2         3         5
                     2         4         10
                     3         4         11",
                     header = TRUE, stringsAsFactors = FALSE)

I can't thank you enough this is a thing of great beauty - moral of the story I need to get better at dplyr! — Sam Lipworth, Dec 29 '17 at 13:17

Uwe · Answer 2 · 2017-12-30T22:48:00.537

The OP has requested to find all distance comparisons from table2 for samples which share a section in table1.

This can be achieved by two different approaches:

Look up the respective section ids for Sample1 and Sample2 each in table1 and keep only those rows of table2 where the section ids match.
Create all unique combinations of sample ids for each section in table1 and find the appropriate entries in table2 (if any).

Approach 1

Base R

tmp <- merge(table2, table1, by.x = "Sample1", by.y = "Sample")
tmp <- merge(tmp, table1, by.x = "Sample2", by.y = "Sample")
tmp[tmp$Section.x == tmp$Section.y, c("Sample2", "Sample1", "distance")]

  Sample2 Sample1 distance
1       2       1       10
2       3       1        1
3       3       2        5

`dplyr`

library(dplyr)
table2 %>% 
  inner_join(table1, by = c(Sample1 = "Sample")) %>% 
  inner_join(table1, by = c(Sample2 = "Sample")) %>% 
  filter(Section.x == Section.y) %>% 
  select(-Section.x, -Section.y)

  Sample1 Sample2 distance
1       1       2       10
2       1       3        1
3       2       3        5

`data.table`

Using nested joins

library(data.table)
tmp <- setDT(table1)[setDT(table2), on = .(Sample == Sample1)]
table1[tmp, on = .(Sample == Sample2)][
  Section == i.Section, .(Sample1 = i.Sample, Sample2 = Sample, distance)]

using merge() and chained data.table expressions

tmp <- merge(setDT(table2), setDT(table1), by.x = "Sample1", by.y = "Sample")
merge(tmp, table1, by.x = "Sample2", by.y = "Sample")[
  Section.x == Section.y, -c("Section.x", "Section.y")]

   Sample2 Sample1 distance
1:       2       1       10
2:       3       1        1
3:       3       2        5

Approach 2

Base R

table1_cross <- do.call(rbind, lst <- lapply(
  split(table1, table1$Section), 
  function(x) as.data.frame(combinat::combn2(x$Sample))))
merge(table2, table1_cross, by.x = c("Sample1", "Sample2"), by.y = c("V1", "V2"))

Here, the handy combn2(x) function is used which generates all combinations of the elements of x taken two at a time, e.g.,

combinat::combn2(1:3)

     [,1] [,2]
[1,]    1    2
[2,]    1    3
[3,]    2    3

The tedious part is to apply combn2() to each group of Section separately and to create a data.frame which can be merged, finally.

`dplyr`

This is a streamlined version of www's approach

full_join(table1, table1, by = "Section") %>%
  filter(Sample.x < Sample.y) %>% 
  semi_join(x = table2, y = ., by = c(Sample1 = "Sample.x", Sample2 = "Sample.y"))

Non-equi self join

library(data.table)
setDT(table2)[setDT(table1)[table1, on = .(Section, Sample < Sample), allow = TRUE,
              .(Section, Sample1 = x.Sample, Sample2 = i.Sample)],
              on = .(Sample1, Sample2), nomatch = 0L]

   Sample1 Sample2 distance Section
1:       1       2       10       1
2:       1       3        1       1
3:       2       3        5       1

Here, a non-equi join is used to create the unique combinations of Sample for each Section. This is equivalent to using combn2():

setDT(table1)[table1, on = .(Section, Sample < Sample), allow = TRUE,
              .(Section, Sample1 = x.Sample, Sample2 = i.Sample)]

   Section Sample1 Sample2
1:       1      NA       1
2:       1       1       2
3:       1       1       3
4:       1       2       3
5:       2      NA       4
6:       2       4       5
7:       3      NA       6

The NA rows will be removed in the final join.

filter items from one column based on shared items in another R

2 Answers2

Approach 1

Base R

dplyr

data.table

Approach 2

Base R

dplyr

Non-equi self join

`dplyr`

`data.table`

`dplyr`