Create new index / re-index in dplyr

Question

I am using a dplyr table in R. Typical fields would be a primary key, an id number identifying a group, a date field, and some values. There are numbersI did some manipulation that throws out a bunch of data in some preliminary steps.

In order to do the next step of my analysis (in MC Stan), It'll be easier if both the date and the group id fields are integer indices. So basically, I need to re-index them as integers between 1 and whatever the total number of distinct elements are (about 750 for group_id and about 250 for date_id, the group_id is already integer, but the date is not). This is relatively straightforward to do after exporting it to a data frame, but I was curious if it is possible in dplyr.

My attempt at creating a new date_val (called date_val_new) is below. Per the discussion in the comments I have some fake data. I purposefully made the group and date values not be 1 to whatever, but I didn't make the date an actual date. I made the data unbalanced, removing some values to illustrate the issue. The dplyr command re-starts the index at 1 for each new group, regardless of what date_val it is. So every group starts at 1, even if the date is different.

df1 <- data.frame(id = 1:40,
              group_id = (10 + rep(1:10, each = 4)),
              date_val = (20 + rep(rep(1:4), 10)),
              val = runif(40))
for (i in c(5, 17, 33))
{
    df1 <- df1[!df1$id == i, ]
}

df_new <- df1 %>%
            group_by(group_id) %>%
            arrange(date_val) %>%
            mutate(date_val_new=row_number(group_id)) %>%
            ungroup()

Maybe a simple example (with fake data if it's easier for you) to see how `df` looks like and how you want your `df_new` to be? It seems, from what I understood, that you want to get all unique dates you have, order them and assign them an integer. Then join back that information (integer) based on the date value you have. Right? — AntoniosK, Nov 03 '15 at 22:03
@AntoniosK I added the sample data and re-wrote the last paragraph. — John, Nov 03 '15 at 22:36
@Frank Thanks. I wasn't familiar with that. I think I can use group_indices to extract a unique index field, which is likely sufficient for my actual application. The downside is that I can't seem to do something like df1 %>% group_by(date_val) %>% mutate(date_val_new = group_indices()). Is this not possible? — John, Nov 03 '15 at 22:41
@John Yeah, I can't find a way to actually use `group_indices` to construct a column... wonder what use it has if not for doing that. Maybe a dplyr guru will come along and enlighten. — Frank, Nov 03 '15 at 22:45
Perhaps the best you can do is the base R method: `df1 %>% mutate(date_val_new = match(date_val, unique(date_val)))`. — Frank, Nov 03 '15 at 22:53
Nice and much simpler @Frank . I think that `df1 %>% mutate(date_val_new = as.integer(as.factor(date_val)))` will also work in this case. Even with real datetime values. — AntoniosK, Nov 03 '15 at 22:57
@Frank Thanks for the update. It looks like his works for postgresql tables. — John, Nov 05 '15 at 17:52

Frank · Accepted Answer · 2015-11-04T01:43:54.390

7

This is the base R method:

df1 %>% mutate(date_val_new = match(date_val, unique(date_val)))

Or with a data.table, df1[, date_val_new := .GRP, by=date_val].

edited Nov 04 '15 at 01:43

answered Nov 03 '15 at 23:21

Frank

66,179
8
96
180

After some testing, neither this or the answer by Steven Beaupre work for the tbl_postgres of my actual data. Works fine for data frames though. For that matter, I don't think group_indices works for tbl_postgres. In the above example, if I convert it to a tbl, it still works. So I'm not entirely sure the issue. – John Nov 04 '15 at 20:37
@John I think that, while dplyr works with databases, it has somewhat limited functionality with them, on account of SQL syntax and DBMS-specific features (or lack of features). I'm not sure whether that's true or where it might be documented, though. Maybe have a look at `vignette("databases")` Steven may know more. – Frank Nov 04 '15 at 20:44
I looked a little at the code behind group_indices (grouped_indices_impl) and I only see things related to data frames. – John Nov 04 '15 at 20:48
@John Yeah, I guess you're right. Not sure if you know, but you can run `methods("group_indices_")` to see that as well. – Frank Nov 04 '15 at 20:50

Steven Beaupré · Answer 2 · 2015-11-05T05:09:39.830

5

Use group_indices_() to generate a unique id for each group:

df1 %>% mutate(date_val_new = group_indices_(., .dots = "date_val"))

Update

Since group_indices() does not handle class tbl_postgres, you could try dense_rank()

copy_to(my_db, df1, name = "df1")
tbl(my_db, "df1") %>% 
  mutate(date_val_new = dense_rank(date_val))

Or build a custom query using sql()

tbl(my_db, sql("SELECT *, 
               DENSE_RANK() OVER (ORDER BY date_val) AS DATE_VAL_NEW
               FROM df1"))

edited Nov 05 '15 at 05:09

answered Nov 04 '15 at 01:44

Steven Beaupré

21,343
7
57
77

Your answer works, but I accepted @Frank's answer instead. I needed something to hang my hat on. I ran microbenchmark and his was faster. – John Nov 04 '15 at 16:31
Frank had pointed me to your updated answer yesterday. The dense_rank does work for tbl_postgres. I had already accepted his, so I wasn't sure if I should change it. – John Nov 06 '15 at 18:29

jazzurro · Answer 3 · 2015-11-04T02:06:16.363

Alternatively, I think you can try getanID() from the splitstackshape package.

library(splitstackshape)
getanID(df1, "group_id")[]

#    id group_id date_val        val .id
# 1:  1       11       21 0.01857242   1
# 2:  2       11       22 0.57124557   2
# 3:  3       11       23 0.54318903   3
# 4:  4       11       24 0.59555088   4
# 5:  6       12       22 0.63045007   1
# 6:  7       12       23 0.74571297   2
# 7:  8       12       24 0.88215668   3

Create new index / re-index in dplyr

3 Answers3