0

I have a dataset which simplified look something like this:

YEAR = c(2009,2009,2009,2009,2009,2009,2009,2010,2010,2010,2010,2010,2010,2010)
FROM = c("A","C","B","D","B","A","C","A","C","B","A","D","B","A")
TO = c("B","D","C","A","D","C","B","B","A","D","D","C","A","D")
DATA = data.frame(YEAR,FROM,TO)

YEAR    FROM    TO
2009    A       B
2009    C       D
2009    B       C
2009    D       A
2009    B       D
2009    A       C
2009    C       B
2010    A       B
2010    C       A
2010    B       D
2010    A       D
2010    D       C
2010    B       A
2010    A       D

What I want is two additional columns, let's say OCC_FROM and OCC_TO, that is the cumulative count of occurrences in both FROM and TO columns in prior rows, by YEAR. Like this:

YEAR    FROM    TO  OCC_FROM    OCC_TO
2009    A       B   0           0
2009    C       D   0           0
2009    B       C   1           1
2009    C       A   2           1
2009    B       D   2           1
2009    A       C   2           3
2009    C       B   4           3
2010    A       B   0           0
2010    C       A   0           1
2010    B       D   1           0
2010    A       B   2           2
2010    D       C   1           1
2010    B       A   3           3
2010    A       D   4           2

What I've managed to produce, with the help of Cumulative count in R, is this, which is obviously not quite what I want since it doesn't take YEAR into consideration:

DATA$OCC_FROM = sapply(1:length(DATA$FROM),function(i)sum(DATA$FROM[i]==DATA$FROM[1:i]))+sapply(1:length(DATA$FROM),function(i)sum(DATA$FROM[i]==DATA$TO[1:i]))-1
DATA$OCC_TO = sapply(1:length(DATA$TO),function(i)sum(DATA$TO[i]==DATA$FROM[1:i]))+sapply(1:length(DATA$TO),function(i)sum(DATA$TO[i]==DATA$TO[1:i]))-1

YEAR    FROM    TO  OCC_FROM    OCC_TO
2009    A       B   0           0
2009    C       D   0           0
2009    B       C   1           1
2009    C       A   2           1
2009    B       D   2           1
2009    A       C   2           3
2009    C       B   4           3
2010    A       B   3           4
2010    C       A   5           4
2010    B       D   5           2
2010    A       B   5           6
2010    D       C   3           6
2010    B       A   7           6
2010    A       D   7           4

Edit: I also want to be able to sum two columns cumulatively based on FROM and TO, by YEAR as before. For simplicity I'll use OCC_FROM and OCC_TO. Like this:

YEAR    FROM    TO  OCC_FROM    OCC_TO    TOTAL_FROM    TOTAL_TO
2009    A       B   0           0         0             0
2009    C       D   0           0         0             0
2009    B       C   1           1         0             0
2009    C       A   2           1         1             0
2009    B       D   2           1         1             0
2009    A       C   2           3         1             3
2009    C       B   4           3         6             3
2010    A       B   0           0         0             0
2010    C       A   0           1         0             0
2010    B       D   1           0         0             0
2010    A       B   2           2         1             1
2010    D       C   1           1         0             0
2010    B       A   3           3         3             3
2010    A       D   4           2         6             1
Community
  • 1
  • 1
Alex T
  • 343
  • 1
  • 4
  • 9

2 Answers2

1

You could try

prevCount <- function(x) {
   eq <- outer(x,x,"==")
   eq <- eq & upper.tri(eq)
   eqInt <- ifelse(eq, 1, 0)
   return(apply(eqInt,2,sum))
}
DATA$OCC_FROM <- ave(DATA$FROM, DATA$YEAR, FUN=prevCount )

prevCount is a function that, within a year, returns the number of elements prior to each element that are identical. The ave call then applies this per year.

Rolling up the corrections in the comment, we get

ord <- order(c(1:nrow(DATA), 1:nrow(data)))
targets <- c(data$FROM, data$TO)[ord]
yr <- c(data$YEAR, data$YEAR)[ord]
res <- ave(targets, yr, FUN=prevCount)
data$occ_from <- res[seq(1, length(res), 2)]
data$occ_to <- res[seq(2, length(res), 2)]

Also, the prevCount function can be simplified as:

prevCount <- function(x) {ave(x==x, x, FUN=cumsum)}
Gavin Kelly
  • 2,374
  • 1
  • 10
  • 13
  • Produces the error: "Error in Math.factor(X[[1L]], ...) : cumsum not meaningful for factors" – Alex T Apr 03 '14 at 16:31
  • 1
    Sorry - misread the question. I've now edited it to change the factors to integer counts of previous instances. It doesn't, however, take both target columns into account. I need to think about how that should be done – Gavin Kelly Apr 03 '14 at 16:32
  • 1
    To deal with the two target columns (TO and FROM), I'd actually merge them. `ord <- order(c(1:nrow(DATA), 1:nrow(DATA))); targets <- c(data$FROM, data$TO)[ord]; yr <- c(data$YEAR, data$YEAR)[ord]; res <- ave(targets, yr, FUN=prevCOUNT)` and then `data$occ_from <- res[seq(along=res, by=2)]; data$occ_to <- res[seq(along=res, by=2)+1]` – Gavin Kelly Apr 03 '14 at 16:39
  • Another error: "Error in `$<-.data.frame`(`*tmp*`, "OCC_FROM", value = c(0, 0, 1, 1, 2, : replacement has 28 rows, data has 14" – Alex T Apr 03 '14 at 16:51
  • 1
    change the first `seq` to `seq(1, length(res),2)` and the second to `seq(2,length(res), 2)` – Gavin Kelly Apr 04 '14 at 10:00
  • Thanks alot! Let's say I now want to tweak the solution into summing two new columns of integers, named SENT and RECEIVED, based on FROM, TO and YEAR How to? – Alex T Apr 04 '14 at 11:35
  • 1
    If `with(DATA, aggregate(SENT, list(FROM, TO, YEAR), sum))` isn't what you want, then you'll need to give an example – Gavin Kelly Apr 04 '14 at 12:03
  • I realize my follow up questions is hard to understand - it's also hard for me to explain it. I've updated the question and would be very grateful if you could solve it for me. – Alex T Apr 04 '14 at 12:37
  • Could you explain why you get the value `6` in row 7 - that might help show what you're trying to achieve. – Gavin Kelly Apr 04 '14 at 14:08
  • The 6 is the sum of the previous values from both OCC_FROM and OCC_TO, that is associated with "C". In this case, 0+1+2+3=6. Thanks for your patience. – Alex T Apr 04 '14 at 14:45
1
# Your data - this is not the data at the top of your question but from your 
# solution  [the 'from' and 'to' don't correspond exactly between your question 
# and solution]

YEAR <- c(2009,2009,2009,2009,2009,2009,2009,2010,2010,2010,2010,2010,2010,2010)
FROM <- c("A","C","B","C","B","A","C","A","C","B","A","D","B","A")
TO <- c( "B","D","C","A","D","C","B","B","A","D","B","C","A","D")
mydf <- data.frame(YEAR,FROM,TO)
names(mydf) <- tolower(names(mydf))

#---------------------------------------------------
# Function to get cumulative sum across columns by group
f <- function(from , to){
        # combine the columns 'from' and 'to' alternately
        l <- c(rbind(from , to))

        # Get and sum duplicate values
        dup <- duplicated(l)
        sums <- ave(dup , l , FUN= cumsum)

        # Reshape data & output
        out <- t(matrix(sums ,2))
        colnames(out) <- c("occ_from","occ_to")
        out
      }

# Not considering year 
f(mydf$from , mydf$to)  
# (data.frame(mydf , f(mydf$from , mydf$to) )

# Calculate by year
s <- split(mydf , mydf$year)
d <- do.call(rbind,lapply(s,function(i) f(i[,"from"],i[,"to"])))
(mydf <- data.frame(mydf , d , row.names=NULL))
user20650
  • 24,654
  • 5
  • 56
  • 91