0

I would like to do a pairwise comparison of every row by group for the difference in a date.time variable.

I have a data frame composed of a "site" variable, a dummy "species" variable, and a POSIXcT "date.time" variable, in ascending order. Each row has a different species, as I'm interested in the time difference between different species visiting a site. 3cols, 50rows added here

structure(list(site = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L), .Label = c("act_0041", 
"act_0048", "ACT0009", "ACT0035", "ACT0041"), class = "factor"), 
    species = c(12, 14, 28, 6, 34, 29, 27, 22, 35, 9, 16, 2, 
    32, 33, 6, 2, 29, 10, 34, 9, 22, 28, 32, 23, 33, 6, 10, 27, 
    12, 34, 32, 31, 10, 30, 6, 14, 35, 8, 23, 32, 12, 34, 22, 
    1, 13, 18, 6, 34, 27, 11), date.time = structure(c(1531454862, 
    1535035906, 1535348634, 1536254587, 1537580136, 1539047529, 
    1539335947, 1542708373, 1545597646, 1548570870, 1548862522, 
    1548970932, 1548970934, 1530624228, 1536088381, 1536270537, 
    1538374649, 1538705865, 1543254377, 1544755701, 1545263758, 
    1546425304, 1546490305, 1530393638, 1531013434, 1532049165, 
    1537459670, 1545803958, 1546142278, 1560118590, 1560203862, 
    1530431347, 1531031939, 1533129189, 1533975327, 1534157098, 
    1535229634, 1535594837, 1536352632, 1536355007, 1536397768, 
    1536707407, 1537231673, 1562873882, 1531454862, 1531595641, 
    1536254587, 1537732697, 1538760001, 1540317399), class = c("POSIXct", 
    "POSIXt"), tzone = "")), row.names = c(NA, -50L), class = c("tbl_df", 
"tbl", "data.frame"))

I want to compare all time differences between rows representing different species, within a group, and create a df or matrix where I have a group (e.g., "site"), each pairwise species combination, and the difference in time between them. E.g., for site a, I have the time difference between species a and species b, species b and c, species a and c etc. If I can avoid duplicating interactions (e.g. site a, species b and c, site a, species c and b) that would be great.

structure(list(site = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "act_0041", class = "factor"), 
    species_interaction = structure(c(1L, 2L, 7L, 12L, 10L, 8L, 
    6L, 5L, 11L, 13L, 3L, 4L, 9L, 14L, 15L, 17L, 16L), .Label = c("12,12", 
    "12,14", "12,16", "12,2", "12,22", "12,27", "12,28", "12,29", 
    "12,32", "12,34", "12,35", "12,6", "12,9", "28,14", "28,28", 
    "28,34", "28,6"), class = "factor"), time.diff..mins. = c(NA, 
    350L, 6502L, 150L, 2065L, 52L, 630L, 542L, 2584L, 241L, 340L, 
    3689L, 201L, 31L, NA, 28L, 356L)), class = "data.frame", row.names =c(NA,-17L))

I've been trying with combn and apply, but I am only getting a matrix of empty integer and values 1,2,3 etc. I know I'm missing something to do with the date.time aspect and maybe displaying df.

df %>% group_by(site)
comb <- t(combn(nrow(as.data.frame(df$species)),2))
dx <- apply(comb, 1, function(x) df[x[1], -1] - df[x[2], -1])
dt <- cbind(comb, dx)

I have been trying to apply this example, though I think I'm missing a simple piece. Any help would be great

Paul B
  • 13
  • 4

1 Answers1

0

One error in your code is that your call to the combn function should be the species column, not nrow(species column), which results in just an integer and is why you are only getting integers there.

However, I don't think just fixing that will give you the result you want, so here is an example using functions from the dplyr package.

First, I created some fake data that matched yours to use:

fake.df <- data.frame(site = rep(LETTERS[1:5], each = 20),
                      species = sample(1:10, 100, replace = T),
                      date.time = as.POSIXct(1531454861+seq(500,50000, by = 500), tz = "", origin  = "1960-01-01"))

For ease of reading, I'm just going to use a for-loop that iterates over each of your sites, identifies species interactions pairs within that site, and calculates the shortest time difference between visits. I specify it is the shortest because you mentioned that you didn't want duplicates, but this fake data will sometimes have species making multiple visits to the site, so we are taking the shortest.

#list object to save iteration by site into
new.dfs <- list()
for(i in 1:length(unique(fake.df$site))){
  this.site <- unique(fake.df$site)[i]
  this.visit <- fake.df[fake.df$site == this.site,]

  #get unique visits by species
  this.pairs <- unique(t(combn(this.visit$species,2)))

  #combine this with the time data for each of these
  this.pair.times <- data.frame(this.pairs) %>%
    rename(species_1 = X1, species_2 = X2) %>%
    #time of first visit
    left_join(this.visit, by = c("species_1" = "species")) %>%
    rename(time_1 = date.time) %>%
    #time of second visit
    left_join(this.visit, by = c("species_2" = "species")) %>%
    rename(time_2 = date.time) %>%
    #calculate time difference
    mutate(time_diff = abs(time_1-time_2)) %>%
    #create ID column of species pairs
    rowwise() %>%
    mutate(species_pair = paste(min(species_1, species_2), max(species_1, species_2), sep = "_")) %>%
    #drop duplicates, keep first visit only (smallest time difference)
    group_by(species_pair) %>%
    arrange(time_diff) %>%
    slice(1) %>%
    #select relevant columns
    select(site = site.x, species_pair, time_diff)

  new.dfs[[i]] <- this.pair.times
}

#combine into one large data.frame
result.df <- bind_rows(new.dfs)

This results in the following dataset:

site  species_pair time_diff
  <fct> <chr>        <drtn>   
1 A     1_1             0 secs
2 A     1_10         1000 secs
3 A     1_2          2000 secs
4 A     1_3           500 secs
5 A     1_4          2000 secs
6 A     1_5           500 secs

Edit[2020-4-30]: keep duplicates

If you want to keep the duplicates and also have the species_pair arranged in the order of arrival (e.g. if species 5 arrived before species 1 it would read 5_1):

#list object to save iteration by site into
new.dfs <- list()
for(i in 1:length(unique(fake.df$site))){
  this.site <- unique(fake.df$site)[i]
  this.visit <- fake.df[fake.df$site == this.site,]

  #get unique visits by species
  this.pairs <- unique(t(combn(this.visit$species,2)))

  #combine this with the time data for each of these
  this.pair.times <- data.frame(this.pairs) %>%
    rename(species_1 = X1, species_2 = X2) %>%
    #time of first visit
    left_join(this.visit, by = c("species_1" = "species")) %>%
    rename(time_1 = date.time) %>%
    #time of second visit
    left_join(this.visit, by = c("species_2" = "species")) %>%
    rename(time_2 = date.time) %>%
    #calculate time difference
    mutate(time_diff = abs(time_1-time_2)) %>%
    #create ID column of species pairs, this time sensitive to order of arrival
    rowwise() %>%
    mutate(species_pair = paste(species_1, species_2, sep = "_")) %>%
    #arrange in order of arrival
    arrange(time_1, time_2) %>%
    #select relevant columns
    select(site = site.x, species_pair, time_1, time_2, time_diff)

  new.dfs[[i]] <- this.pair.times
}

#combine into one large data.frame
result.df <- bind_rows(new.dfs)

This would then be ordered first by the time of visit of the first species in the pair (time_1) and then by the time of visit of the second species in the pair (time_2).

> result.df[100:115,]
Source: local data frame [16 x 5]
Groups: <by row>

# A tibble: 16 x 5
   site  species_pair time_1              time_2              time_diff
   <fct> <chr>        <dttm>              <dttm>              <drtn>   
 1 A     8_7          2008-07-12 00:57:41 2008-07-12 01:39:21 2500 secs
 2 A     8_3          2008-07-12 00:57:41 2008-07-12 01:47:41 3000 secs
 3 A     8_8          2008-07-12 00:57:41 2008-07-12 01:56:01 3500 secs
 4 A     8_4          2008-07-12 00:57:41 2008-07-12 02:04:21 4000 secs
 5 A     8_7          2008-07-12 00:57:41 2008-07-12 02:12:41 4500 secs
 6 A     8_8          2008-07-12 00:57:41 2008-07-12 02:21:01 5000 secs
 7 A     8_9          2008-07-12 00:57:41 2008-07-12 02:29:21 5500 secs
 8 A     8_8          2008-07-12 00:57:41 2008-07-12 02:37:41 6000 secs
 9 A     8_8          2008-07-12 00:57:41 2008-07-12 02:46:01 6500 secs
10 A     8_2          2008-07-12 00:57:41 2008-07-12 02:54:21 7000 secs
11 A     6_3          2008-07-12 01:06:01 2008-07-12 00:49:21 1000 secs
12 A     6_8          2008-07-12 01:06:01 2008-07-12 00:57:41  500 secs
13 A     6_6          2008-07-12 01:06:01 2008-07-12 01:06:01    0 secs
14 A     6_4          2008-07-12 01:06:01 2008-07-12 01:14:21  500 secs
15 A     6_6          2008-07-12 01:06:01 2008-07-12 01:22:41 1000 secs
16 A     6_9          2008-07-12 01:06:01 2008-07-12 01:31:01 1500 secs

You may notice that sometimes time_2 is less than time_1 because we haven't filtered for that. If you only want visits after each species 1, then you can filter for it using filter(result.df, time_2>time_1)

m.evans
  • 606
  • 3
  • 15
  • thanks for this! I wasn't sure how to approach with a for-loop so your explanation was great, and this worked a charm. One question I have is, does this keep the order of each interaction? Each interaction could be conditional on the series of site visits previously so eventually I want to look at each in order, with the ability to keep the duplicate interactions. For example, if a series of species visits to a site was, in order, 1,4,1,1,7,4, if I just removed the line that drops duplicates could I still pull out **in order** 1-4, 1-1, 1-1, 1-7, 1-4, then 4-1,4-1, 4-7, 4-4? – Paul B Apr 30 '20 at 17:04
  • For that you can drop the whole three lines under the drop duplicates comment, and then if you want them in order of visit, you can arrange them by time_1 and time_2. I will edit the above to include an example of this that also keeps the species_pairs in visit order, since the above code always places the lower numbered species first – m.evans Apr 30 '20 at 20:43
  • this is brilliant thanks for putting this together - super clear answer and a huge help! I had started an inelegant solution with expand.grid but couldn't maintain the ordering. – Paul B May 01 '20 at 23:43
  • glad it was helpful! – m.evans May 03 '20 at 19:09