How to use R 3.3.2 to add index column to dataframe based on column value?

Question

Problem

How to create a index column in a dataframe for R given the categorical values in a column?

In other words, assume we have a dataframe as follows:

id cat 
1  A 
2  A 
3  A
4  B 
5  B 
6  C
7  C 
8  C 
9  C
10 C

How can we create a column called rank that does the following:

id cat rank 
1  A   1 
2  A   2
3  A   3 
4  B   1 
5  B   2 
6  C   1 
7  C   2
8  C   3
9  C   4 
10 C   5

Attempts

Assume the dataframe is called df. I tried the following:

aggregate(df, by = c('A','B','C'), length)
Starting writing a custom function to work with lapply, but ran into too many boundary cases.

This gave me mismatched length errors. Obviously the idea here was to get the count for each group and then write a function that can take the row value with lapply and keep counting until I hit the length.

Additional Notes

I am thinking about abandoning the above ideas and splitting the dataframe into separate smaller dataframes by cat value. I will then create a rank variable for each dataframe that is indexed. The challenge then becomes, what is a good way to combine all the dataframe values back into one dataframe with the new rank column?

None of this is sitting quite right with me though. Frankly, my gut says I'm doing this all wrong. Am I making this too hard? Is there a package or R trick that does this easily? I apologize if this seems silly, but I cannot in good conscious proceed further without seeking the advice of R programmers more skilled than me.

If you mean an actual grouped ranking of the `id` column, `df$rank <- ave(df$id, df$cat, FUN = rank)` — alistaire, Feb 27 '17 at 16:15
@OliverFrost IMHO, while very similar I grant you, I am trying an index rank based on column value and the use case differs from mine. Thank you for the link. — hlyates, Feb 27 '17 at 16:19
I gave upvotes for everyone who commented. I will mark the answer once I am able to test. Thanks for your help everyone. I am overwhelmed and humbled by the gracious responses. — hlyates, Feb 27 '17 at 16:20

score 6 · Accepted Answer · answered Feb 27 '17 at 16:12

6

dplyr has a dedicated function for that, row_number:

df %>%
    group_by(cat) %>%
    mutate(rank = row_number())

answered Feb 27 '17 at 16:12

Konrad Rudolph

530,221
131
937
1,214

asachet · Answer 2 · 2017-02-27T16:10:12.240

2

Using the dplyr package, this is quite straightforward.

Assuming your dataframe is called df, this will do the trick:

df %>%
  group_by(cat) %>%
  mutate(rank=1:n())

The idea is that you create a variable (with mutate) defined, for every group (defined by the value of cat, cf call to group_by), as the vector 1:n() where n() is the number of observations in the group.

The pipe %>% is a shortcut to avoid nested function calls. x %>% f means f(x) and x %>% f(y) means f(x, y). So x %>% f(y) %>% g(z) means g(f(x, y), z). Some like it and some hate it!

edited Feb 27 '17 at 16:10

answered Feb 27 '17 at 16:05

asachet

6,620
2
30
74

Probably safer to use `rank`, i.e. `mutate(rank = rank(id))` or `dplyr::dense_rank`. – alistaire Feb 27 '17 at 16:13

d.b · Answer 3 · 2017-02-27T16:17:22.770

1

Here's base R solution (similar to the approach mentioned in your Additional Notes)

do.call(rbind, lapply(split(df, df$cat), function(x) cbind(x, rank = seq_along(x$id))))

edited Feb 27 '17 at 16:17

answered Feb 27 '17 at 16:16

d.b

32,245
6
36
77

`ave` is way simpler. – alistaire Feb 27 '17 at 16:17

How to use R 3.3.2 to add index column to dataframe based on column value?

Problem

Attempts

Additional Notes

3 Answers3