7

I have been unable to find a solution to my query on Stack Overflow. This post is similar, but my dataset is slightly - and importantly - different (in that I have multiple measures of 'time' within my grouping variable).

I have observations of organisms at various sites, over time. The sites are further aggregated into larger areas, so I want to eventually have a function I can call in ddply to summarize the dataset for each of the time periods within the geographical areas. However, I'm having trouble getting the function I need.

Question

How do I cycle through time periods and compare with the previous time period, calculating the intersection (i.e. number of 'sites' occurring in both time periods) and the sum of the number occurring in each period?

Toy dataset:

time = c(1,1,1,1,2,2,2,3,3,3,3,3)
site = c("A","B","C","D","A","B","C","A","B","C","D","E")
df <- as.data.frame(cbind(time,site))
df$time = as.numeric(df$time) 

My function

dist2 <- function(df){
  for(i in unique(df$time))
  {
    intersection <- length(which(df[df$time==i,"site"] %in% df[df$time==i-   1,"site"]))
    both <- length(unique(df[df$time==i,"site"])) + length(unique(df[df$time==i-1,"site"]))
  }
  return(as.data.frame(cbind(time,intersection,both)))
  }

dist2(df)

What I get:

dist2(df)
   time intersection both
1     1            3    8
2     1            3    8
3     1            3    8
4     1            3    8
5     2            3    8
6     2            3    8
7     2            3    8
8     3            3    8
9     3            3    8
10    3            3    8
11    3            3    8
12    3            3    8

What I expect (hoped!) to achieve:

time intersection both
1    1           NA    4
2    2            3    7
3    3            3    8

Once I have a working function, I want to use it with ddply on the whole data set to calculate these value for each area.

Many thanks for any pointers, tips, advice!

I am running:

R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Community
  • 1
  • 1

3 Answers3

4

You can determine the number of times each site appeared at each time with the table function:

(tab <- table(df$time, df$site))
#     A B C D E
#   1 1 1 1 1 0
#   2 1 1 1 0 0
#   3 1 1 1 1 1

With some simple manipulation, you can build same-sized table that contains the number of times a site appeared in the previous time period:

(prev.tab <- head(rbind(NA, tab), -1))
#    A  B  C  D  E
#   NA NA NA NA NA
# 1  1  1  1  1  0
# 2  1  1  1  0  0

Determining the number of sites in common with the previous iteration or the number of unique sites in the previous iteration plus the number of unique sites in the current iteration are now simple vectorized operations:

data.frame(time=unique(df$time),
           intersection=rowSums(tab * (prev.tab >= 1)),
           both=rowSums(tab >= 1) + rowSums(prev.tab >= 1, na.rm=TRUE))
#   time intersection both
# 1    1           NA    4
# 2    2            3    7
# 3    3            3    8

Because this doesn't involve making a bunch of intersection or unique calls involving pairs of time values it should be more efficient than looping solutions:

# Slightly larger dataset with 100000 observations
set.seed(144)
df <- data.frame(time=sample(1:50, 100000, replace=TRUE),
                 site=sample(letters, 100000, replace=TRUE))
df <- df[order(df$time),]
josilber <- function(df) {
  tab <- table(df$time, df$site)
  prev.tab <- head(rbind(NA, tab), -1)
  data.frame(time=unique(df$time),
             intersection=rowSums(tab * (prev.tab >= 1)),
             both=rowSums(tab >= 1) + rowSums(prev.tab >= 1, na.rm=TRUE))
}
# dist2 from @akrun's solution
microbenchmark(josilber(df), dist2(df))
# Unit: milliseconds
#          expr       min        lq      mean    median         uq       max neval
#  josilber(df)  28.74353  32.78146  52.73928  40.89203   62.04933  237.7774   100
#     dist2(df) 540.78422 574.28319 829.04174 825.99418 1018.76561 1607.9460   100
josliber
  • 43,891
  • 12
  • 98
  • 133
  • Good use of table, really fast code. Did above benchmark on my solution and it was slightly over 10 times slower than yours, mainly due to `rbind/make.unique` – Pafnucy May 25 '15 at 20:18
1

You can modify the function

dist2 <- function(df){
   Un1 <- unique(df$time)
   intersection <- numeric(length(Un1))
   both <- numeric(length(Un1))

  for(i in seq_along(Un1)){
    intersection[i] <- length(which(df[df$time==Un1[i],"site"] %in% 
             df[df$time==Un1[i-1],"site"]))
    both[i] <- length(unique(df[df$time==Un1[i],"site"])) + 
               length(unique(df[df$time==Un1[i-1],"site"]))
   }
   return(data.frame(time=Un1, intersection, both))
  }

dist2(df)
#    time intersection both
#1    1            0    4
#2    2            3    7
#3    3            3    8
akrun
  • 874,273
  • 37
  • 540
  • 662
1

Here is my memory intensive proposal

df <- rbind(df, within(df, {time = time + 1}))
ddply(df, ~time, summarize, intersect = sum(duplicated(site)), both = length(site)) -> res
res <- res[-nrow(res), ]
res

Output:

  time intersect both
1    1         0    4
2    2         3    7
3    3         3    8

Change 0 to NA and you're done.

Pafnucy
  • 608
  • 1
  • 13
  • 17