0

I am trying to calculate the total distance traveled by a specific person, but I'm not sure how to specify it for the dist() function, so that I will get individual's distance, and not everyone's distance all summed up (e.g. John + James + Bob +...'s). The data looks something like this (but lot bigger)

Name    x    y
John    12  34
John    15  31
John    8   38
John    20  14
John    12  35
Bob     2   15
Bob     2   18
James   30  21
James   30  28
James   29  32
...

My current code is:

dist(rbind(data$x,data$y), method = "euclidean"). 

I've tried putting in if(data$name == "John") code everywhere possible with {} and what not, but they all seem to give me an error. Can anyone help me please?

Alexis
  • 4,950
  • 1
  • 18
  • 37
Robo
  • 25
  • 6
  • 2
    Why don't you share part of your data along with your question? – MKR Jun 15 '18 at 14:42
  • 1
    Welcome to StackOverflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610). This will make it much easier for others to help you. – Jaap Jun 15 '18 at 14:48
  • I'll add it ASAP. It's my first time using this site so might take a while – Robo Jun 15 '18 at 14:55
  • With out seeing your data it is a bit tough to tell, but I would guess you need to subset your data frame to have just John's data for example. 'data[data$name == "John", ]` then use `dist`. – Ian Wesley Jun 15 '18 at 14:55

2 Answers2

0

Using the dplyr package you can apply the dist function over each subset of the name varible. The solution is based on the answer found here.

library(dplyr)
data = data.frame(name = c(rep('John',5), rep('Steve', 5), rep('Dave', 5)), x=sample(1:10,15), y=sample(1:10,15))
distout = data %>% group_by(name) %>% summarise(distmatrix=dist(rbind(x, y), method = "euclidean"))
Jacob F
  • 366
  • 1
  • 6
  • Thank you for the reply! I'll try this as well as subsetting my data as Ian suggested. I will need some time to research and fully understand the code you have suggested though, because I've just started using R. But I'll definitely give it a go. – Robo Jun 15 '18 at 15:15
0

If you're calculating distance travelled, then I think you need the distance between contiguous coordinates. You can use the dist function provided by the proxy package, which is a bit more flexible than the default one, and combine it with dplyr:

library(proxy)
library(dplyr)

df <- data.frame(Name = c(rep("John", 5L), rep("Steve", 5L), rep("Dave", 5L)), 
                 x = sample(1:30, 15L),
                 y = sample(1:30, 15L))

group_fun <- function(sub_df) {
    if (nrow(sub_df) == 1L)
        return(data.frame(Name = sub_df$Name, total = 0))

    x <- sub_df[-nrow(sub_df), c("x", "y")]
    y <- sub_df[-1L, c("x", "y")]
    total <- sum(proxy::dist(x, y, method = "Euclidean", pairwise = TRUE))
    # return
    data.frame(Name = sub_df$Name[1L], total = total)
}

out <- df %>%
    group_by(Name) %>%
    do(group_fun(.))

Inside group_fun x contains all coordinates except the last one, and y contains all coordinates except the first one (per group in both cases), so x[i,] and y[i,] contain contiguous coordinates for any i. Therefore, when we call proxy::dist with pairwise = TRUE, we now get the distance between each pair (row-wise).

In the returned data frame we use sub_df$Name[1L] because Name was a grouping variable, so it must be the same for all rows in sub_df, and we only want one of its values in the summary.

And if you want to be a bit more compact you can do it without dist (i.e. only with dplyr):

out <- df %>%
    group_by(Name) %>%
    summarise(total = sum(sqrt((x - lag(x))^2 + (y - lag(y))^2), na.rm = TRUE))
Alexis
  • 4,950
  • 1
  • 18
  • 37