Count function for combinations of repeated observations and factors in a dataframe within and across time

Question

Suppose I have data of the following type:

df <- data.frame(student = c("S1", "S2", "S3", "S4", "S5", "S2", "S6", "S1", "S7", "S8"), 
              factor = c("A", "A", "A", "A", "A", "B", "B", "C", "C", "D"), 
              year =  c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2), 
              count1 = c(0, 1, 0, 0, 0, 1, 0, 0, 0, 0), 
              count2 = c(1, 0, 0, 0, 0, 0, 0, 1, 0, 0))

I need a more efficient way than typical apply() functions to analyze the the two columns for student and class in a given year. When a student maintains the same factor-level in a given year, the function returns a count of zero. When a student is in more than one factor-level in a given year, the count is updated i+1 for each instance of the student in a separate factor-level.

I would like a separate count/functionality to analyze students in the data set across years. For instance, a student that maintains the same factor-level across years receives a count of zero. If a student is found in separate years to have separate factor-levels the count is updated i+1 for each instance.

There are over 10k observations, so my attempts at *apply have been unproductive. Namely, I have been able to count unique instances of each student & factor BUT only the first unique instance not all unique instances of a student (unique id) and factor. Individuals may be repeated either within or across years.

The ideal output is as follows:

Student1,Factor.Count(Within Year),Factor.Count(Between Year)

It's difficult to understand question without sample data. Please add reproducible sample for good people here to help you. See http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — CHP, Apr 26 '13 at 01:39
Two comments. 1 (somewhat minor): 10K observations is nothing near what you need before `apply` gets too expensive. 2 (somewhat major): it isn't entirely clear what you want. Change your example data so that some student actually gets a score of 0, and give the desired result for the example. — Matthew Lundberg, Apr 26 '13 at 02:19
Please see the additional count columns added to the dataframe/sample code above. — DV Hughes, Apr 26 '13 at 02:31

score 0 · Accepted Answer · answered Apr 26 '13 at 03:00

Here's a chain of commands that gets you there, using factor interactions to find the factor change for a student in the same year:

# Add up the occurrences of a student having multiple factors in the same year,
# for each year
in.each.year <- aggregate(factor~student:year, data=df, FUN=function(x) length(x)-1)[c(1,3)]

# Total these up, for each student
in.year <- aggregate(factor~student, data=in.each.year, FUN=sum)

# The name was "factor".  Set it to the desired name.
names(in.year)[2] <- 'count1'

# Find the occurrences of a student having multiple factors
both <- aggregate(factor~student, data=df, FUN=function(x) length(x)-1)
names(both)[2] <- 'both'

# Combine with 'merge'
m <- merge(in.year, both)

# Subtract to find "count2"
m$count2 <- m$both - m$count1
m$both <- NULL

m
##   student count1 count2
## 1      S1      0      1
## 2      S2      1      0
## 3      S3      0      0
## 4      S4      0      0
## 5      S5      0      0
## 6      S6      0      0
## 7      S7      0      0
## 8      S8      0      0

This can be merged with your original data frame (without the columns count1 and count2):

merge(df, m)
##    student factor year count1 count2
## 1       S1      A    1      0      1
## 2       S1      C    2      0      1
## 3       S2      A    1      1      0
## 4       S2      B    1      1      0
## 5       S3      A    1      0      0
## 6       S4      A    1      0      0
## 7       S5      A    1      0      0
## 8       S6      B    1      0      0
## 9       S7      C    2      0      0
## 10      S8      D    2      0      0

Count function for combinations of repeated observations and factors in a dataframe within and across time

1 Answers1