R - Apply dist function to groups

Question

I am trying to apply the dist() function row wise in R but the result I get is as if it isn't grouping at all, it is simply applying dist() to all of my dataframe.

df2 %>% dplyr::group_by(X1) %>% dist()

Where df2 is my dataframe and I am just applying to the head for now, for simplicity. Essentially, each group contains coordinates (A,B) and I am trying to get the distance between each point.

Here is my dataframe:

   X1  A              B
1   1  12             0.0
2   1  18             0.0
3   1  18             1.0
4   1  13             0.0
5   1  18             4.0
6   1  18             0.0
7   1  18             5.0
8   1  18             0.0
9   1  18             0.0
10  2  73            -2.0
11  2  73            -0.5
12  2  74            -0.5
13  2  73             0.0
14  2  71            -1.0
15  2  75             0.0

My desired output is the lower triangular matrix of each group, here is an example:

Can you describe the desired outcome a little more? Correct me if I'm wrong but dist returns a matrix so if you're grouping by a vector in df2, then the distance matrices will likely be different sizes. Did you want a list of distance matrices? — svenhalvorson, Jun 06 '17 at 15:54
@svenhalvorson Yes exactly, I added more details. I actually don't care about the matrix output itself, I will just throw all the values of the lower triangular matrix into a vector. — guy, Jun 06 '17 at 16:04
@tbone I am still unclear about the problem you are facing but based on what I understand here's a solution that I came up with: `mydf <- df2 %>% dplyr::group_by(X1) %>% dplyr::summarise(distmatrix=list(dist(cbind(A,B))))` and the `distmatrix` column of `mydf` will contain the list of distance matrices for each group. — A Gore, Jun 06 '17 at 16:25
@AGore this works but gives an odd output, `... ` followed by the values of the distmatrix and then `... ` again — guy, Jun 06 '17 at 16:31
@tbone If you do `mydf$distmatrix` it will give you the list of distance matrix. — A Gore, Jun 06 '17 at 16:32
@AGore Ah I see, it stil gives the wrong value that I have been dealing with the whole time. It did not perform the dist operation per group but on the whole original df :( — guy, Jun 06 '17 at 16:33
You can always split and apply `lapply(split(df, df$X1), dist)` — Sotos, Jun 06 '17 at 16:41
@tbone It should give you the right answer. Look at my solution. — A Gore, Jun 06 '17 at 16:49

score 2 · Accepted Answer · answered Jun 06 '17 at 16:28

2

Here's an example of creating distance matrices of the iris data set by species

results = list()

for(spec in unique(iris$Species)){
  temp = iris[iris$Species==spec, 1:4]
  results[[length(results)+1]] = dist(temp)
}
names(results) = unique(iris$Species)

You'll have to figure out what to do with it afterwords.

answered Jun 06 '17 at 16:28

svenhalvorson

1,090
8
21

I would rather not use for loops if possible. – guy Jun 06 '17 at 16:34

score 1 · Answer 2 · answered Jun 06 '17 at 16:40

We can user purrr::map:

library(purrr)

df %>% 
  split(.$X1) %>% 
  map(~{
    dist(.x)
  }) -> distList

distList
#> $`1`
#>          1        2        3        4        5        6        7        8
#> 2 6.000000                                                               
#> 3 6.082763 1.000000                                                      
#> 4 1.000000 5.000000 5.099020                                             
#> 5 7.211103 4.000000 3.000000 6.403124                                    
#> 6 6.000000 0.000000 1.000000 5.000000 4.000000                           
#> 7 7.810250 5.000000 4.000000 7.071068 1.000000 5.000000                  
#> 8 6.000000 0.000000 1.000000 5.000000 4.000000 0.000000 5.000000         
#> 9 6.000000 0.000000 1.000000 5.000000 4.000000 0.000000 5.000000 0.000000
#> 
#> $`2`
#>          10       11       12       13       14
#> 11 1.500000                                    
#> 12 1.802776 1.000000                           
#> 13 2.000000 0.500000 1.118034                  
#> 14 2.236068 2.061553 3.041381 2.236068         
#> 15 2.828427 2.061553 1.118034 2.000000 4.123106

Data:

df <- read.table(text = 'X1  A              B
1   1  12             0.0
2   1  18             0.0
3   1  18             1.0
4   1  13             0.0
5   1  18             4.0
6   1  18             0.0
7   1  18             5.0
8   1  18             0.0
9   1  18             0.0
10  2  73            -2.0
11  2  73            -0.5
12  2  74            -0.5
13  2  73             0.0
14  2  71            -1.0
15  2  75             0.0', h = T)

This is seems like a great solution but it yields `Error in numeric(nrowz) : invalid 'length' argument` , I am running your code exactly. — guy, Jun 06 '17 at 16:45
Is `purrr` updated and loaded? It's literally the only thing that's going on there. I tested on R 3.3.3 and `purrr` 0.2.2.2 . — GGamba, Jun 06 '17 at 16:50
Then the `df` object is different, and seeing as others answers don't work either, I suggest you the to add result of `dput(df)` to you question — GGamba, Jun 06 '17 at 17:04
I think the issue is with my X1 column has over 1000 factors even though I am accessing a really small subset of my data. I think i need to unfactorize or something. — guy, Jun 06 '17 at 17:19

A Gore · Answer 3 · 2017-06-06T17:09:16.177

Here's my code and the solution

require(dplyr)
df2 <- structure(list(X1 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L), A = c(12L, 18L, 18L, 13L, 18L, 18L, 18L, 
18L, 18L, 73L, 73L, 74L, 73L, 71L, 75L), B = c(0, 0, 1, 0, 4, 
0, 5, 0, 0, -2, -0.5, -0.5, 0, -1, 0)), .Names = c("X1", "A", 
"B"), class = "data.frame", row.names = c("1", "2", "3", "4", 
"5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15"))
mydf <- df2 %>% group_by(X1) %>% summarise(distmatrix=list(dist(cbind(A,B))))
mydf
# # A tibble: 2 × 2
# X1 distmatrix
# <int>     <list>
#   1     1 <S3: dist>
#   2     2 <S3: dist>
mydf$distmatrix
# [[1]]
# 1        2        3        4        5        6        7        8
# 2 6.000000                                                               
# 3 6.082763 1.000000                                                      
# 4 1.000000 5.000000 5.099020                                             
# 5 7.211103 4.000000 3.000000 6.403124                                    
# 6 6.000000 0.000000 1.000000 5.000000 4.000000                           
# 7 7.810250 5.000000 4.000000 7.071068 1.000000 5.000000                  
# 8 6.000000 0.000000 1.000000 5.000000 4.000000 0.000000 5.000000         
# 9 6.000000 0.000000 1.000000 5.000000 4.000000 0.000000 5.000000 0.000000
# 
# [[2]]
# 1        2        3        4        5
# 2 1.500000                                    
# 3 1.802776 1.000000                           
# 4 2.000000 0.500000 1.118034                  
# 5 2.236068 2.061553 3.041381 2.236068         
# 6 2.828427 2.061553 1.118034 2.000000 4.123106

@tbone I have attached the solution. You should get the same solution. My question would be then is whether you want the same solution or is the solution matching the output that you are getting? — A Gore, Jun 06 '17 at 17:02
While I think this uses the right logic, you could do it without `dplyr`: `out <- by(df2[, -1L], df2$X1, dist, simplify = FALSE)`. Then `out[[1L]]` would have the lower triangular of the first group and so on. — Alexis, Jun 17 '18 at 14:10

R - Apply dist function to groups

3 Answers3

Data:

Linked