1

I am trying to apply the dist() function row wise in R but the result I get is as if it isn't grouping at all, it is simply applying dist() to all of my dataframe.

df2 %>% dplyr::group_by(X1) %>% dist()

Where df2 is my dataframe and I am just applying to the head for now, for simplicity. Essentially, each group contains coordinates (A,B) and I am trying to get the distance between each point.

Here is my dataframe:

   X1  A              B
1   1  12             0.0
2   1  18             0.0
3   1  18             1.0
4   1  13             0.0
5   1  18             4.0
6   1  18             0.0
7   1  18             5.0
8   1  18             0.0
9   1  18             0.0
10  2  73            -2.0
11  2  73            -0.5
12  2  74            -0.5
13  2  73             0.0
14  2  71            -1.0
15  2  75             0.0

My desired output is the lower triangular matrix of each group, here is an example: enter image description here

guy
  • 1,021
  • 2
  • 16
  • 40
  • 1
    Can you provide a reproducible example and desired output? – A Gore Jun 06 '17 at 15:54
  • Can you describe the desired outcome a little more? Correct me if I'm wrong but dist returns a matrix so if you're grouping by a vector in df2, then the distance matrices will likely be different sizes. Did you want a list of distance matrices? – svenhalvorson Jun 06 '17 at 15:54
  • @svenhalvorson Yes exactly, I added more details. I actually don't care about the matrix output itself, I will just throw all the values of the lower triangular matrix into a vector. – guy Jun 06 '17 at 16:04
  • @AGore I added more details to the question. – guy Jun 06 '17 at 16:05
  • @tbone I am still unclear about the problem you are facing but based on what I understand here's a solution that I came up with: `mydf <- df2 %>% dplyr::group_by(X1) %>% dplyr::summarise(distmatrix=list(dist(cbind(A,B))))` and the `distmatrix` column of `mydf` will contain the list of distance matrices for each group. – A Gore Jun 06 '17 at 16:25
  • @AGore this works but gives an odd output, `... ` followed by the values of the distmatrix and then `... ` again – guy Jun 06 '17 at 16:31
  • @tbone If you do `mydf$distmatrix` it will give you the list of distance matrix. – A Gore Jun 06 '17 at 16:32
  • @AGore Ah I see, it stil gives the wrong value that I have been dealing with the whole time. It did not perform the dist operation per group but on the whole original df :( – guy Jun 06 '17 at 16:33
  • 2
    You can always split and apply `lapply(split(df, df$X1), dist)` – Sotos Jun 06 '17 at 16:41
  • @tbone It should give you the right answer. Look at my solution. – A Gore Jun 06 '17 at 16:49

3 Answers3

2

Here's an example of creating distance matrices of the iris data set by species

results = list()

for(spec in unique(iris$Species)){
  temp = iris[iris$Species==spec, 1:4]
  results[[length(results)+1]] = dist(temp)
}
names(results) = unique(iris$Species)

You'll have to figure out what to do with it afterwords.

svenhalvorson
  • 1,090
  • 8
  • 21
1

We can user purrr::map:

library(purrr)

df %>% 
  split(.$X1) %>% 
  map(~{
    dist(.x)
  }) -> distList

distList
#> $`1`
#>          1        2        3        4        5        6        7        8
#> 2 6.000000                                                               
#> 3 6.082763 1.000000                                                      
#> 4 1.000000 5.000000 5.099020                                             
#> 5 7.211103 4.000000 3.000000 6.403124                                    
#> 6 6.000000 0.000000 1.000000 5.000000 4.000000                           
#> 7 7.810250 5.000000 4.000000 7.071068 1.000000 5.000000                  
#> 8 6.000000 0.000000 1.000000 5.000000 4.000000 0.000000 5.000000         
#> 9 6.000000 0.000000 1.000000 5.000000 4.000000 0.000000 5.000000 0.000000
#> 
#> $`2`
#>          10       11       12       13       14
#> 11 1.500000                                    
#> 12 1.802776 1.000000                           
#> 13 2.000000 0.500000 1.118034                  
#> 14 2.236068 2.061553 3.041381 2.236068         
#> 15 2.828427 2.061553 1.118034 2.000000 4.123106

Data:

df <- read.table(text = 'X1  A              B
1   1  12             0.0
2   1  18             0.0
3   1  18             1.0
4   1  13             0.0
5   1  18             4.0
6   1  18             0.0
7   1  18             5.0
8   1  18             0.0
9   1  18             0.0
10  2  73            -2.0
11  2  73            -0.5
12  2  74            -0.5
13  2  73             0.0
14  2  71            -1.0
15  2  75             0.0', h = T)
GGamba
  • 13,140
  • 3
  • 38
  • 47
  • This is seems like a great solution but it yields `Error in numeric(nrowz) : invalid 'length' argument` , I am running your code exactly. – guy Jun 06 '17 at 16:45
  • Is `purrr` updated and loaded? It's literally the only thing that's going on there. I tested on R 3.3.3 and `purrr` 0.2.2.2 . – GGamba Jun 06 '17 at 16:50
  • It is loaded and works. I am on the same R version as well/ – guy Jun 06 '17 at 16:55
  • Then the `df` object is different, and seeing as others answers don't work either, I suggest you the to add result of `dput(df)` to you question – GGamba Jun 06 '17 at 17:04
  • I think the issue is with my X1 column has over 1000 factors even though I am accessing a really small subset of my data. I think i need to unfactorize or something. – guy Jun 06 '17 at 17:19
1

Here's my code and the solution

require(dplyr)
df2 <- structure(list(X1 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L), A = c(12L, 18L, 18L, 13L, 18L, 18L, 18L, 
18L, 18L, 73L, 73L, 74L, 73L, 71L, 75L), B = c(0, 0, 1, 0, 4, 
0, 5, 0, 0, -2, -0.5, -0.5, 0, -1, 0)), .Names = c("X1", "A", 
"B"), class = "data.frame", row.names = c("1", "2", "3", "4", 
"5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15"))
mydf <- df2 %>% group_by(X1) %>% summarise(distmatrix=list(dist(cbind(A,B))))
mydf
# # A tibble: 2 × 2
# X1 distmatrix
# <int>     <list>
#   1     1 <S3: dist>
#   2     2 <S3: dist>
mydf$distmatrix
# [[1]]
# 1        2        3        4        5        6        7        8
# 2 6.000000                                                               
# 3 6.082763 1.000000                                                      
# 4 1.000000 5.000000 5.099020                                             
# 5 7.211103 4.000000 3.000000 6.403124                                    
# 6 6.000000 0.000000 1.000000 5.000000 4.000000                           
# 7 7.810250 5.000000 4.000000 7.071068 1.000000 5.000000                  
# 8 6.000000 0.000000 1.000000 5.000000 4.000000 0.000000 5.000000         
# 9 6.000000 0.000000 1.000000 5.000000 4.000000 0.000000 5.000000 0.000000
# 
# [[2]]
# 1        2        3        4        5
# 2 1.500000                                    
# 3 1.802776 1.000000                           
# 4 2.000000 0.500000 1.118034                  
# 5 2.236068 2.061553 3.041381 2.236068         
# 6 2.828427 2.061553 1.118034 2.000000 4.123106
A Gore
  • 1,870
  • 2
  • 15
  • 26
  • @tbone I have attached the solution. You should get the same solution. My question would be then is whether you want the same solution or is the solution matching the output that you are getting? – A Gore Jun 06 '17 at 17:02
  • While I think this uses the right logic, you could do it without `dplyr`: `out <- by(df2[, -1L], df2$X1, dist, simplify = FALSE)`. Then `out[[1L]]` would have the lower triangular of the first group and so on. – Alexis Jun 17 '18 at 14:10