I have a data frame which contains an event history and I want to check its integrity by checking whether the last event for each ID number matches the current value in the system for that ID number. The data are coded as factors. The following toy data frame is a minimal example:
df <-data.frame(ID=c(1,1,1,1,2,2,2,3,3),
current.grade=as.factor(c("Senior","Senior","Senior","Senior",
"Junior","Junior","Junior",
"Sophomore","Sophomore")),
grade.history=as.factor(c("Freshman","Sophomore","Junior","Senior",
"Freshman","Sophomore","Junior",
"Freshman","Sophomore")))
which gives output
> df
ID current.grade grade.history
1 1 Senior Freshman
2 1 Senior Sophomore
3 1 Senior Junior
4 1 Senior Senior
5 2 Junior Freshman
6 2 Junior Sophomore
7 2 Junior Junior
8 3 Sophomore Freshman
9 3 Sophomore Sophomore
> str(df)
'data.frame': 9 obs. of 3 variables:
$ ID : num 1 1 1 1 2 2 2 3 3
$ current.grade: Factor w/ 3 levels "Junior","Senior",..: 2 2 2 2 1 1 1 3 3
$ grade.history: Factor w/ 4 levels "Freshman","Junior",..: 1 4 2 3 1 4 2 1 4
I want to use dplyr
to extract the last value in grade.history
and check it against current.grade
:
df.summary <- df %>%
group_by(ID) %>%
summarize(current.grade.last=last(current.grade),
grade.history.last=last(grade.history))
However, dplyr
seems to convert the factors to integers, so I get this:
> df.summary
Source: local data frame [3 x 3]
ID current.grade.last grade.history.last
1 1 2 3
2 2 1 2
3 3 3 4
> str(df.summary)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 3 obs. of 3 variables:
$ ID : num 1 2 3
$ current.grade.last: int 2 1 3
$ grade.history.last: int 3 2 4
Note that the values don't line up because the original factors had different level sets. What's the right way to do this with dplyr
?
I'm using R
version 3.1.1 and dplyr
version 0.3.0.2