0

I have a table were each sample has a unique identifier but also a section identifier. I want to extract all vs all distance comparisons for each section (this data comes from a second table)

eg table 1

Sample    Section
1         1
2         1
3         1
4         2
5         2
6         3

table 2

sample    sample    distance
1         2         10
1         3         1
1         4         2
2         3         5
2         4         10
3         4         11

so my desired output is a list which has distance for: [1 vs 2], [1 vs 3], [2 vs 3], [4 vs 5] - ie all distance comparisons from table two for samples which share a section in table 1

I started trying to do this with nested for loops, but it quickly got messy.. Any ideas of a neat way to do this?

www
  • 38,575
  • 12
  • 48
  • 84
Sam Lipworth
  • 107
  • 4
  • 12
  • Please use `dput()` to show your data! – jogo Feb 16 '18 at 14:08
  • ?? what is this – Sam Lipworth Feb 16 '18 at 14:09
  • With the output from `dput(...)` we can reproduce your data, e.g. `dput(BOD)`. The data can exactly reproduced by `B <- ..... # (output from dput(...))` So, put the output from `dput(...)` in your question, please read: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example Alternative you can give a definition of your dataframe, i.e. `table1 <- data.frame(...)` . – jogo Feb 16 '18 at 14:14
  • ah clever OK noted thanks I will do in future – Sam Lipworth Feb 16 '18 at 14:16

2 Answers2

1

A solution using .

We can first create a data frame showing the combination of samples in each section.

library(dplyr)

table1_cross <- full_join(table1, table1, by = "Section") %>%    # Full join by Section
  filter(Sample.x != Sample.y) %>%                               # Remove records with same samples
  rowwise() %>%
  mutate(Sample.all = toString(sort(c(Sample.x, Sample.y)))) %>% # Create a column showing the combination between Sample.x and Sample.y
  ungroup() %>%
  distinct(Sample.all, .keep_all = TRUE) %>%                     # Remove duplicates in Sample.all
  select(Sample1 = Sample.x, Sample2 = Sample.y, Section)
table1_cross
# # A tibble: 4 x 3
#   Sample1 Sample2 Section
#     <int>   <int>   <int>
# 1       1       2       1
# 2       1       3       1
# 3       2       3       1
# 4       4       5       2

We can then filter table2 by table1_cross. table3 is the final output.

table3 <- table2 %>%                                     
  semi_join(table1_cross, by = c("Sample1", "Sample2")) # Filter table2 based on table1_corss

table3
#   Sample1 Sample2 distance
# 1       1       2       10
# 2       1       3        1
# 3       2       3        5

DATA

table1 <- read.table(text = "Sample    Section
1         1
                     2         1
                     3         1
                     4         2
                     5         2
                     6         3",
                     header = TRUE, stringsAsFactors = FALSE)

table2 <- read.table(text = "Sample1    Sample2    distance
1         2         10
                     1         3         1
                     1         4         2
                     2         3         5
                     2         4         10
                     3         4         11",
                     header = TRUE, stringsAsFactors = FALSE)
www
  • 38,575
  • 12
  • 48
  • 84
0

The OP has requested to find all distance comparisons from table2 for samples which share a section in table1.

This can be achieved by two different approaches:

  1. Look up the respective section ids for Sample1 and Sample2 each in table1 and keep only those rows of table2 where the section ids match.
  2. Create all unique combinations of sample ids for each section in table1 and find the appropriate entries in table2 (if any).

Approach 1

Base R

tmp <- merge(table2, table1, by.x = "Sample1", by.y = "Sample")
tmp <- merge(tmp, table1, by.x = "Sample2", by.y = "Sample")
tmp[tmp$Section.x == tmp$Section.y, c("Sample2", "Sample1", "distance")]
  Sample2 Sample1 distance
1       2       1       10
2       3       1        1
3       3       2        5

dplyr

library(dplyr)
table2 %>% 
  inner_join(table1, by = c(Sample1 = "Sample")) %>% 
  inner_join(table1, by = c(Sample2 = "Sample")) %>% 
  filter(Section.x == Section.y) %>% 
  select(-Section.x, -Section.y)
  Sample1 Sample2 distance
1       1       2       10
2       1       3        1
3       2       3        5

data.table

Using nested joins

library(data.table)
tmp <- setDT(table1)[setDT(table2), on = .(Sample == Sample1)]
table1[tmp, on = .(Sample == Sample2)][
  Section == i.Section, .(Sample1 = i.Sample, Sample2 = Sample, distance)]

using merge() and chained data.table expressions

tmp <- merge(setDT(table2), setDT(table1), by.x = "Sample1", by.y = "Sample")
merge(tmp, table1, by.x = "Sample2", by.y = "Sample")[
  Section.x == Section.y, -c("Section.x", "Section.y")]
   Sample2 Sample1 distance
1:       2       1       10
2:       3       1        1
3:       3       2        5

Approach 2

Base R

table1_cross <- do.call(rbind, lst <- lapply(
  split(table1, table1$Section), 
  function(x) as.data.frame(combinat::combn2(x$Sample))))
merge(table2, table1_cross, by.x = c("Sample1", "Sample2"), by.y = c("V1", "V2"))

Here, the handy combn2(x) function is used which generates all combinations of the elements of x taken two at a time, e.g.,

combinat::combn2(1:3)
     [,1] [,2]
[1,]    1    2
[2,]    1    3
[3,]    2    3

The tedious part is to apply combn2() to each group of Section separately and to create a data.frame which can be merged, finally.

dplyr

This is a streamlined version of www's approach

full_join(table1, table1, by = "Section") %>%
  filter(Sample.x < Sample.y) %>% 
  semi_join(x = table2, y = ., by = c(Sample1 = "Sample.x", Sample2 = "Sample.y"))

Non-equi self join

library(data.table)
setDT(table2)[setDT(table1)[table1, on = .(Section, Sample < Sample), allow = TRUE,
              .(Section, Sample1 = x.Sample, Sample2 = i.Sample)],
              on = .(Sample1, Sample2), nomatch = 0L]
   Sample1 Sample2 distance Section
1:       1       2       10       1
2:       1       3        1       1
3:       2       3        5       1

Here, a non-equi join is used to create the unique combinations of Sample for each Section. This is equivalent to using combn2():

setDT(table1)[table1, on = .(Section, Sample < Sample), allow = TRUE,
              .(Section, Sample1 = x.Sample, Sample2 = i.Sample)]
   Section Sample1 Sample2
1:       1      NA       1
2:       1       1       2
3:       1       1       3
4:       1       2       3
5:       2      NA       4
6:       2       4       5
7:       3      NA       6

The NA rows will be removed in the final join.

Uwe
  • 41,420
  • 11
  • 90
  • 134