Handling factor variables in dplyr

Question

I have a data frame which contains an event history and I want to check its integrity by checking whether the last event for each ID number matches the current value in the system for that ID number. The data are coded as factors. The following toy data frame is a minimal example:

df <-data.frame(ID=c(1,1,1,1,2,2,2,3,3),
                 current.grade=as.factor(c("Senior","Senior","Senior","Senior",
                                         "Junior","Junior","Junior",
                                         "Sophomore","Sophomore")),
                 grade.history=as.factor(c("Freshman","Sophomore","Junior","Senior",
                                   "Freshman","Sophomore","Junior",
                                   "Freshman","Sophomore")))

which gives output

> df
  ID current.grade grade.history
1  1        Senior      Freshman
2  1        Senior     Sophomore
3  1        Senior        Junior
4  1        Senior        Senior
5  2        Junior      Freshman
6  2        Junior     Sophomore
7  2        Junior        Junior
8  3     Sophomore      Freshman
9  3     Sophomore     Sophomore
> str(df)
'data.frame':   9 obs. of  3 variables:
 $ ID           : num  1 1 1 1 2 2 2 3 3
 $ current.grade: Factor w/ 3 levels "Junior","Senior",..: 2 2 2 2 1 1 1 3 3
 $ grade.history: Factor w/ 4 levels "Freshman","Junior",..: 1 4 2 3 1 4 2 1 4

I want to use dplyr to extract the last value in grade.history and check it against current.grade:

df.summary <- df %>%
  group_by(ID) %>%
  summarize(current.grade.last=last(current.grade),
            grade.history.last=last(grade.history))

However, dplyr seems to convert the factors to integers, so I get this:

> df.summary
Source: local data frame [3 x 3]

  ID current.grade.last grade.history.last
1  1                  2                  3
2  2                  1                  2
3  3                  3                  4
> str(df.summary)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   3 obs. of  3 variables:
 $ ID                : num  1 2 3
 $ current.grade.last: int  2 1 3
 $ grade.history.last: int  3 2 4

Note that the values don't line up because the original factors had different level sets. What's the right way to do this with dplyr?

I'm using R version 3.1.1 and dplyr version 0.3.0.2

Add `stringsAsFactors = FALSE` when creating your data set and your code will work fine, see `?data.frame`. See also [here](http://stackoverflow.com/questions/3418128/how-to-convert-a-factor-to-an-integer-numeric-without-a-loss-of-information) for general understanding of factors, or [here](http://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters) — David Arenburg, Jan 10 '15 at 18:07
@David It seems that OP (quite reasonably) *wants* to use factors here. — Konrad Rudolph, Jan 10 '15 at 18:36
@KonradRudolph, that may be also true (I didn't read the background, rather the actual problem they are facing), though I'm not sure how do you define *quite reasonably* (memory wise?). — David Arenburg, Jan 10 '15 at 18:47
@David, Konrad is correct. I would like to keep everything as factors, if possible. Is it possible with `dplyr`? My actual data set is quite large, so I'm guessing that converting all of the factors to characters and then back to factors will be quite costly, both in terms of memory and computation. — tcquinn, Jan 10 '15 at 21:08
Why do you want to keep them as factors? Any perticulat reason? — David Arenburg, Jan 10 '15 at 21:24
@David: In my real example, I have hundreds of thousands of rows and about 20 different column pairs like the `current.grade` and `grade.history` column pair above that I want to work with. Each of these columns has a relatively small level set (fewer than 15 possible values), but each of those values is a long character string. I'm assuming that it's much more efficient in terms of memory and computation for R to be manipulating the integers from 1 to 15 (or whatever) than hundreds of thousands of long character strings with tons of redundant information in them. Do you think that's wrong? — tcquinn, Jan 10 '15 at 21:33

eipi10 · Answer 1 · 2015-01-10T21:55:04.360

Another way to approach this is to put your factor levels in their natural order, in this case Freshman, Sophomore, Junior, Senior, and then select the highest value for each ID using the which.max function for indexing. If you do it this way, you won't have to worry about whether your columns are ordered from lowest to highest grade for each ID (as you do if you use the last function).

library(dplyr)

df <-data.frame(ID=c(1,1,1,1,2,2,2,3,3),
                current.grade=as.factor(c("Senior","Senior","Senior","Senior",
                                          "Junior","Junior","Junior",
                                          "Sophomore","Sophomore")),
                grade.history=as.factor(c("Freshman","Sophomore","Junior","Senior",
                                          "Freshman","Sophomore","Junior",
                                          "Freshman","Sophomore")))


# Ordered vector of grades
gradeLookup = c("Freshman", "Sophomore", "Junior", "Senior")

# Reset the values in the grade columns to the ordering in gradeLookup
df[,-1] = lapply(df[,-1], function(x) {
  factor(x, levels=gradeLookup)
})

# For each ID, select the values of current.grade and grade.history at the maximum
# value of grade.history
df %>% group_by(ID) %>%
  summarise(current.grade.last = current.grade[which.max(grade.history)],
            grade.history.last = grade.history[which.max(grade.history)])

  ID current.grade.last grade.history.last
1  1             Senior             Senior
2  2             Junior             Junior
3  3          Sophomore          Sophomore

UPDATE 2: Since you want to sort and capture the last value (rather than the maximum value) by column, rather than whole rows, try this:

df %>% group_by(ID) %>%
  summarise(current.grade.last = current.grade[length(grade.history)],
            grade.history.last = grade.history[length(grade.history)])

END UPDATE 2

Does your data include a time variable, like year, term, or academic year? If so, you can dispense with current.grade and direclty select the value of grade.history at the most recent year of attendance. This will give you each student's last grade level. For example (assuming your time variable is called year):

df %>% group_by(ID) %>%
  summarise(last.grade = grade.history[which.max(year)])

UPDATE 1: I'm not sure what's causing your code to return the numerical code for each level, rather than the level label. It's not just an issue with the last function (you can see this if you do last(df$grade.history)). However, if you want to sort by time-stamp and then return the last row, the code below will keep the level labels. slice returns the rows you specify within each value of ID. In this case we specify the last row by using n(), which returns the total number of rows for each value of ID.

df.summary <- df %>%
  group_by(ID) %>%
  slice(n())

Yes, my real data set has a time stamp for each entry, and I use that time stamp to sort within each group using `arrange()` before I apply the `last()` function (i.e. I want the most recent entry, not necessarily the one with the highest value for grade or whatever). I left that step out of my toy example for simplicity. Your last suggestion is a good one, assuming that `last()` is the culprit (as opposed to `summarize()` or `group_by()`. Do you think that's the case? — tcquinn, Jan 10 '15 at 21:16
[Responding to update] Your solution using `slice()` does indeed work in my toy example (thank you!), but in my real example, I only want to apply this operation to certain columns (I want to apply other summary functions to other columns). That's why I was using `summarize()` and `last()`. Seems like it's one of those two functions that's converting the factors to integers. Any other suggestions? I supposed I could use `select()` to pull out the columns I want, then apply `slice()`, then later rejoin the result with the result of my other summary functions using `merge()`, but...ugh. — tcquinn, Jan 10 '15 at 21:48
[Responding to update 2] Yep. That works (and is similar to @lukeA's workaround above). The `last()` function seems to be the function that's converting the factors to integers. That's too bad, as it's my understanding that this is the sort of thing that the `last()` function was designed to do. — tcquinn, Jan 10 '15 at 22:09

score 0 · Answer 2 · answered Jan 10 '15 at 17:45

0

I guess it lies in the nature of a factor object in R, which is set of integer codes with a "levels" attribute of mode character. One way to overcome your problem: Wrap the factor variables into as.character:

  df.summary <- df %>%
  group_by(ID) %>%
  summarize(current.grade.last=last(as.character(current.grade)),
            grade.history.last=last(as.character(grade.history)))

answered Jan 10 '15 at 17:45

lukeA

53,097
5
97
100

Yes, I understand how factors work. I'd like to keep these variables as factors, if possible. My actual data set is quite large, so I'm guessing that converting all of the factors to characters and then back to factors would be fairly costly, both in terms of memory and computation. – tcquinn Jan 10 '15 at 21:11
What if you replace `last(...)` by `tail(..., 1)`? – lukeA Jan 10 '15 at 21:31
Yes. Using `tail(..., 1)` rather than `last(...)` seems to keep the variables as factors. That's a good workaround. Thank you. Seems too bad, though, since my understanding is that this is the sort of thing that the `last()` function was designed to do. – tcquinn Jan 10 '15 at 21:53

Handling factor variables in dplyr

2 Answers2