-2

Problem

How to create a index column in a dataframe for R given the categorical values in a column?

In other words, assume we have a dataframe as follows:

id cat 
1  A 
2  A 
3  A
4  B 
5  B 
6  C
7  C 
8  C 
9  C
10 C

How can we create a column called rank that does the following:

id cat rank 
1  A   1 
2  A   2
3  A   3 
4  B   1 
5  B   2 
6  C   1 
7  C   2
8  C   3
9  C   4 
10 C   5

Attempts

Assume the dataframe is called df. I tried the following:

  • aggregate(df, by = c('A','B','C'), length)
  • Starting writing a custom function to work with lapply, but ran into too many boundary cases.

This gave me mismatched length errors. Obviously the idea here was to get the count for each group and then write a function that can take the row value with lapply and keep counting until I hit the length.

Additional Notes

I am thinking about abandoning the above ideas and splitting the dataframe into separate smaller dataframes by cat value. I will then create a rank variable for each dataframe that is indexed. The challenge then becomes, what is a good way to combine all the dataframe values back into one dataframe with the new rank column?

None of this is sitting quite right with me though. Frankly, my gut says I'm doing this all wrong. Am I making this too hard? Is there a package or R trick that does this easily? I apologize if this seems silly, but I cannot in good conscious proceed further without seeking the advice of R programmers more skilled than me.

hlyates
  • 1,279
  • 3
  • 22
  • 44
  • 2
    this one? `ave(1:nrow(d), d$cat, FUN = seq_along)` – Roman Feb 27 '17 at 16:08
  • 2
    If you mean an actual grouped ranking of the `id` column, `df$rank <- ave(df$id, df$cat, FUN = rank)` – alistaire Feb 27 '17 at 16:15
  • @OliverFrost IMHO, while very similar I grant you, I am trying an index rank based on column value and the use case differs from mine. Thank you for the link. – hlyates Feb 27 '17 at 16:19
  • I gave upvotes for everyone who commented. I will mark the answer once I am able to test. Thanks for your help everyone. I am overwhelmed and humbled by the gracious responses. – hlyates Feb 27 '17 at 16:20

3 Answers3

6

dplyr has a dedicated function for that, row_number:

df %>%
    group_by(cat) %>%
    mutate(rank = row_number())
Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
2

Using the dplyr package, this is quite straightforward.

Assuming your dataframe is called df, this will do the trick:

df %>%
  group_by(cat) %>%
  mutate(rank=1:n())

The idea is that you create a variable (with mutate) defined, for every group (defined by the value of cat, cf call to group_by), as the vector 1:n() where n() is the number of observations in the group.

The pipe %>% is a shortcut to avoid nested function calls. x %>% f means f(x) and x %>% f(y) means f(x, y). So x %>% f(y) %>% g(z) means g(f(x, y), z). Some like it and some hate it!

asachet
  • 6,620
  • 2
  • 30
  • 74
1

Here's base R solution (similar to the approach mentioned in your Additional Notes)

do.call(rbind, lapply(split(df, df$cat), function(x) cbind(x, rank = seq_along(x$id))))
d.b
  • 32,245
  • 6
  • 36
  • 77