3

I am using a dplyr table in R. Typical fields would be a primary key, an id number identifying a group, a date field, and some values. There are numbersI did some manipulation that throws out a bunch of data in some preliminary steps.

In order to do the next step of my analysis (in MC Stan), It'll be easier if both the date and the group id fields are integer indices. So basically, I need to re-index them as integers between 1 and whatever the total number of distinct elements are (about 750 for group_id and about 250 for date_id, the group_id is already integer, but the date is not). This is relatively straightforward to do after exporting it to a data frame, but I was curious if it is possible in dplyr.

My attempt at creating a new date_val (called date_val_new) is below. Per the discussion in the comments I have some fake data. I purposefully made the group and date values not be 1 to whatever, but I didn't make the date an actual date. I made the data unbalanced, removing some values to illustrate the issue. The dplyr command re-starts the index at 1 for each new group, regardless of what date_val it is. So every group starts at 1, even if the date is different.

df1 <- data.frame(id = 1:40,
              group_id = (10 + rep(1:10, each = 4)),
              date_val = (20 + rep(rep(1:4), 10)),
              val = runif(40))
for (i in c(5, 17, 33))
{
    df1 <- df1[!df1$id == i, ]
}

df_new <- df1 %>%
            group_by(group_id) %>%
            arrange(date_val) %>%
            mutate(date_val_new=row_number(group_id)) %>%
            ungroup()
John
  • 395
  • 1
  • 7
  • 23
  • Maybe a simple example (with fake data if it's easier for you) to see how `df` looks like and how you want your `df_new` to be? It seems, from what I understood, that you want to get all unique dates you have, order them and assign them an integer. Then join back that information (integer) based on the date value you have. Right? – AntoniosK Nov 03 '15 at 22:03
  • Are you looking for this thing? `?group_indices` – Frank Nov 03 '15 at 22:17
  • @AntoniosK Okay. I'll make one. – John Nov 03 '15 at 22:17
  • @AntoniosK I added the sample data and re-wrote the last paragraph. – John Nov 03 '15 at 22:36
  • @Frank Thanks. I wasn't familiar with that. I think I can use group_indices to extract a unique index field, which is likely sufficient for my actual application. The downside is that I can't seem to do something like df1 %>% group_by(date_val) %>% mutate(date_val_new = group_indices()). Is this not possible? – John Nov 03 '15 at 22:41
  • @John Yeah, I can't find a way to actually use `group_indices` to construct a column... wonder what use it has if not for doing that. Maybe a dplyr guru will come along and enlighten. – Frank Nov 03 '15 at 22:45
  • 1
    Perhaps the best you can do is the base R method: `df1 %>% mutate(date_val_new = match(date_val, unique(date_val)))`. – Frank Nov 03 '15 at 22:53
  • Nice and much simpler @Frank . I think that `df1 %>% mutate(date_val_new = as.integer(as.factor(date_val)))` will also work in this case. Even with real datetime values. – AntoniosK Nov 03 '15 at 22:57
  • @Frank Thanks for the update. It looks like his works for postgresql tables. – John Nov 05 '15 at 17:52

3 Answers3

7

This is the base R method:

df1 %>% mutate(date_val_new = match(date_val, unique(date_val)))

Or with a data.table, df1[, date_val_new := .GRP, by=date_val].

Frank
  • 66,179
  • 8
  • 96
  • 180
  • After some testing, neither this or the answer by Steven Beaupre work for the tbl_postgres of my actual data. Works fine for data frames though. For that matter, I don't think group_indices works for tbl_postgres. In the above example, if I convert it to a tbl, it still works. So I'm not entirely sure the issue. – John Nov 04 '15 at 20:37
  • @John I think that, while dplyr works with databases, it has somewhat limited functionality with them, on account of SQL syntax and DBMS-specific features (or lack of features). I'm not sure whether that's true or where it might be documented, though. Maybe have a look at `vignette("databases")` Steven may know more. – Frank Nov 04 '15 at 20:44
  • I looked a little at the code behind group_indices (grouped_indices_impl) and I only see things related to data frames. – John Nov 04 '15 at 20:48
  • @John Yeah, I guess you're right. Not sure if you know, but you can run `methods("group_indices_")` to see that as well. – Frank Nov 04 '15 at 20:50
5

Use group_indices_() to generate a unique id for each group:

df1 %>% mutate(date_val_new = group_indices_(., .dots = "date_val"))

Update

Since group_indices() does not handle class tbl_postgres, you could try dense_rank()

copy_to(my_db, df1, name = "df1")
tbl(my_db, "df1") %>% 
  mutate(date_val_new = dense_rank(date_val))

Or build a custom query using sql()

tbl(my_db, sql("SELECT *, 
               DENSE_RANK() OVER (ORDER BY date_val) AS DATE_VAL_NEW
               FROM df1"))
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
  • Your answer works, but I accepted @Frank's answer instead. I needed something to hang my hat on. I ran microbenchmark and his was faster. – John Nov 04 '15 at 16:31
  • Frank had pointed me to your updated answer yesterday. The dense_rank does work for tbl_postgres. I had already accepted his, so I wasn't sure if I should change it. – John Nov 06 '15 at 18:29
1

Alternatively, I think you can try getanID() from the splitstackshape package.

library(splitstackshape)
getanID(df1, "group_id")[]

#    id group_id date_val        val .id
# 1:  1       11       21 0.01857242   1
# 2:  2       11       22 0.57124557   2
# 3:  3       11       23 0.54318903   3
# 4:  4       11       24 0.59555088   4
# 5:  6       12       22 0.63045007   1
# 6:  7       12       23 0.74571297   2
# 7:  8       12       24 0.88215668   3
jazzurro
  • 23,179
  • 35
  • 66
  • 76