2

I have one data table with the following columns:

name, x, y
a, 1, 2
b, 2, 3
c, 3, 1

I want to join this table with itself, keeping every row where name != name and run a distance function on the x and y values from each side. The result should be in the format:

name1, name2, distance

I wrote the distance function like this:

dist <- function(a, b) sqrt((a$x-b$x)^2 + (a$y-b$y)^2)

I tried to use the outer function, but it only takes vectors, not data tables and I tried using the various joins in dplyr but was unsuccessful.

Nick Larsen
  • 18,631
  • 6
  • 67
  • 96
  • Looks like you are just using euclidean distance. I think you just want the `dist()` function to do the calculation. It would be easier to test with a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input data. – MrFlick Jul 29 '16 at 21:08

2 Answers2

8

Suppose your dataset looked like:

d <- data_frame(name = rownames(mtcars), x = mtcars$mpg, y = mtcars$cyl)

A general way to try all combinations of two data frames (or all combinations with themselves) is tidyr's crossing function (though you'll need to be careful about renaming the columns). Afterwards you'll be able to compute the distance and do your filtering:

library(dplyr)
library(tidyr)

d %>%
  rename(name1 = name, x1 = x, y1 = y) %>%
  crossing(d) %>%
  rename(name2 = name, x2 = x, y2 = y) %>%
  mutate(distance = sqrt((x1 - x2) ^ 2 + (y1 - y2) ^ 2)) %>%
  filter(name1 != name2)

In this particular case, you could use my fuzzyjoin package, specifically distance_join (you'll need the latest development version from GitHub). This joins two data frames (in this case, a self-join) based on a distance threshold, and appends an additional column with the distance:

library(fuzzyjoin)

d %>%
  rename(name1 = name) %>%
  distance_inner_join(d, max_dist = Inf, distance_col = "distance") %>%
  rename(name2 = name) %>%
  filter(name1 != name2)

This will give:

# A tibble: 992 x 7
       name1   x.x   y.x             name2   x.y   y.y distance
       <chr> <dbl> <dbl>             <chr> <dbl> <dbl>    <dbl>
1  Mazda RX4    21     6     Mazda RX4 Wag  21.0     6 0.000000
2  Mazda RX4    21     6        Datsun 710  22.8     4 2.690725
3  Mazda RX4    21     6    Hornet 4 Drive  21.4     6 0.400000
4  Mazda RX4    21     6 Hornet Sportabout  18.7     8 3.047950
5  Mazda RX4    21     6           Valiant  18.1     6 2.900000
6  Mazda RX4    21     6        Duster 360  14.3     8 6.992138
7  Mazda RX4    21     6         Merc 240D  24.4     4 3.944617
8  Mazda RX4    21     6          Merc 230  22.8     4 2.690725
9  Mazda RX4    21     6          Merc 280  19.2     6 1.800000
10 Mazda RX4    21     6         Merc 280C  17.8     6 3.200000
# ... with 982 more rows

You could set max_dist to another, non-infinite threshold if you know you don't care about distant matches.

David Robinson
  • 77,383
  • 16
  • 167
  • 187
3

Here is a base R method that uses cbind and dist (the function mentioned by @mrflick). we have a data.frame named df created at the bottom of this post.

Note that dist returns a lower triangular matrix:

dist(df[,-1])
         1        2
2 1.414214         
3 2.236068 2.236068

We can use combn to create pairwise comparisons of the names variable, then combine the result into a data.frame and give the columns names with setNames.

dfNew <- setNames(data.frame(t(combn(df$name, 2)),
                             combn(df$name, 2, function(i) {
                                                 dist(df[df$name %in% i, -1])})),
                  c("var1", "var2", "distance"))

which returns

dfNew
      var1 var2 distance
    1    a    b 1.414214
    2    a    c 2.236068
    3    b    c 2.236068

Note that the names variable must be character or you have to wrap it in the the as.character function for this to work.

data

df <- read.table(header=TRUE, text="name, x, y
a, 1, 2
b, 2, 3
c, 3, 1", sep=",", stringsAsFactors=F)
lmo
  • 37,904
  • 9
  • 56
  • 69