20

I have a tbl_df where I want to group_by(u, v) for each distinct integer combination observed with (u, v).


EDIT: this was subsequently resolved by adding the (now-deprecated) group_indices() back in dplyr 0.4.0


a) I then want to assign each distinct group some arbitrary distinct number label=1,2,3... e.g. the combination (u,v)==(2,3) could get label 1, (1,3) could get 2, and so on. How to do this with one mutate(), without a three-step summarize-and-self-join?

dplyr has a neat function n(), but that gives the number of elements within its group, not the overall number of the group. In data.table this would simply be called .GRP.

b) Actually what I really want to assign a string/character label ('A','B',...). But numbering groups by integers is good-enough, because I can then use integer_to_label(i) as below. Unless there's a clever way to merge these two? But don't sweat this part.

set.seed(1234)

# Helper fn for mapping integer 1..26 to character label
integer_to_label <- function(i) { substr("ABCDEFGHIJKLMNOPQRSTUVWXYZ",i,i) }

df <- tibble::as_tibble(data.frame(u=sample.int(3,10,replace=T), v=sample.int(4,10,replace=T)))

# Want to label/number each distinct group of unique (u,v) combinations
df %>% group_by(u,v) %>% mutate(label = n()) # WRONG: n() is number of element within its group, not overall number of group

   u v
1  2 3
2  1 3
3  1 2
4  2 3
5  1 2
6  3 3
7  1 3
8  1 2
9  3 1
10 3 4

KLUDGE 1: could do df %>% group_by(u,v) %>% summarize(label = n()) , then self-join
smci
  • 32,567
  • 20
  • 113
  • 146
  • @Randy-Lai and I both solved it, separately. Randy's is a cleaner idiom that lends itself to multiple `mutate/summarize(...)` actions. I found `interaction(u,v, drop=T)` – smci Apr 12 '14 at 23:30
  • What do you need this for? – hadley Apr 14 '14 at 23:11
  • @hadley: my particular reason is as stated in the question: I want to assign each distinct (u,v)-group some arbitrary (ordered) numbering=1,2,3... so I can ultimately assign them string labels 'A','B','C'... (my purpose is to subsequently refer to them by shorthand, in modeling and graphing) – smci Nov 18 '14 at 22:49
  • @hadley: but in general this is a useful feature, and data.table package implements `.GRP` for this. Any chance we can have something in dplyr please? :) – smci Nov 18 '14 at 22:51
  • 6
    next version will have `group_indices()` – hadley Nov 19 '14 at 15:59
  • @hadley Thanks! New in [0.4.0 (1/2015)](https://github.com/hadley/dplyr/releases) – smci Mar 16 '15 at 13:36
  • @SamFirke: thanks for the updates and answer, but please leave my ancient cave scribblings in the question. Also, don't delete the comparison to `data.table`, that's all useful too. – smci Feb 25 '21 at 21:59

6 Answers6

51

For current dplyr versions (1.0.0 and higher)

Since version 1.0, dplyr has a new cur_group_id function for that:

df %>% 
    group_by(u, v) %>% 
    mutate(label = cur_group_id()) ...
    

For previous dplyr versions (before 1.0.0, although the function is deprecated but still available in 1.0.10)

dplyr has a group_indices() function that you can use like this:

df %>% 
    mutate(label = group_indices(., u, v)) %>% 
    group_by(label) ...
Calimo
  • 7,510
  • 4
  • 39
  • 61
  • 5
    group_indices() uses the (alphabetical) ordering of the grouping variable though, is there any way of using it to preserve the ordering in the table, or applying your own? – maja zaloznik Sep 17 '19 at 12:52
  • 1
    Note that `group_indices()` was deprecated in dplyr 1.0.0. and has been replaced with `cur_group_id()`. – C. Rea Apr 19 '23 at 19:01
11

Another approach using data.table would be

require(data.table)
setDT(df)[,label:=.GRP, by = c("u", "v")]

which results in:

    u v label
 1: 2 1     1
 2: 1 3     2
 3: 2 1     1
 4: 3 4     3
 5: 3 1     4
 6: 1 1     5
 7: 3 2     6
 8: 2 3     7
 9: 3 2     6
10: 3 4     3
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
Rentrop
  • 20,979
  • 10
  • 72
  • 100
9

As of dplyr version 1.0.4, the function cur_group_id() has replaced the older function group_indices.

Call it on the grouped data.frame:

df %>%
  group_by(u, v) %>%
  mutate(label = cur_group_id())

# A tibble: 10 x 3
# Groups:   u, v [6]
       u     v label
   <int> <int> <int>
 1     2     2     4
 2     2     2     4
 3     1     3     2
 4     3     2     6
 5     1     4     3
 6     1     2     1
 7     2     2     4
 8     2     4     5
 9     3     2     6
10     2     4     5
Sam Firke
  • 21,571
  • 9
  • 87
  • 105
6

Updated answer

get_group_number = function(){
    i = 0
    function(){
        i <<- i+1
        i
    }
}
group_number = get_group_number()
df %>% group_by(u,v) %>% mutate(label = group_number())

You can also consider the following slightly unreadable version

group_number = (function(){i = 0; function() i <<- i+1 })()
df %>% group_by(u,v) %>% mutate(label = group_number())

using iterators package

library(iterators)

counter = icount()
df %>% group_by(u,v) %>% mutate(label = nextElem(counter))
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
Randy Lai
  • 3,084
  • 2
  • 22
  • 23
  • 1
    No, this is wrong. I'm **not** looking for the row-number within a group. I'm looking for the **group-number** (the equivalent of `data.table .GRP`). Since we have 7 unique combinations of (u,v) in this example, the output labels should be 1:7 (in some arbitrary order) – smci Apr 12 '14 at 05:29
  • 1
    Sorry, I didn't pay much attention to your question. I have updated the answer with a dirty solution... – Randy Lai Apr 12 '14 at 05:35
  • not bad but that's essentially just a generator function that returns incrementing integers... surely we can obviate it? – smci Apr 12 '14 at 05:39
  • 1
    ^ Does R not do generator functions? (like Python `yield`?) Without having to manually save state inside your fn? – smci Apr 12 '14 at 07:25
  • 2
    you remind me of `iterators` package. I have never used it before. (And see the updated solution). But it is essentially equivalent to my original method. – Randy Lai Apr 12 '14 at 07:32
  • ^ Wow that's awesome! Best answer. Did you see my new one using `interaction(u,v)`? Can you figure out how to reorder the levels in increasing order? – smci Apr 12 '14 at 08:01
  • i think you will get the correct order if you order `df`. – Randy Lai Apr 12 '14 at 08:07
  • Assume we want to preserve the order of df (we do, my real case is more complicated). It would be clunky to `dplyr::arrange(u,v)` then do this group-numbering then revert to `dplyr::arrange()` – smci Apr 12 '14 at 08:09
  • may be `factor(interaction(sort(df$u),sort(df$v)))` (I didn't test it). – Randy Lai Apr 12 '14 at 08:14
  • Naw... I've been trying many things unsuccessfully for a while now. Might post as a separate question. – smci Apr 12 '14 at 08:40
  • Solved - see my updated answer. And question link below. – smci Apr 12 '14 at 09:27
  • Update: New [group_indices_ in 0.4.0 (1/2015)](https://github.com/hadley/dplyr/releases) – smci Mar 16 '15 at 13:37
2

Updating my answer with three different ways:

A) A neat non-dplyr solution using interaction(u,v):

> df$label <- factor(interaction(df$u,df$v, drop=T))
 [1] 1.3 2.3 2.2 2.4 3.2 2.4 1.2 1.2 2.1 2.1
 Levels: 2.1 1.2 2.2 3.2 1.3 2.3 2.4

> match(df$label, levels(df$label)[ rank(unique(df$label)) ] )
 [1] 1 2 3 4 5 4 6 6 7 7

B) Making Randy's neat fast-and-dirty generator-function answer more compact:

get_next_integer = function(){
  i = 0
  function(u,v){ i <<- i+1 }
}
get_integer = get_next_integer() 

df %>% group_by(u,v) %>% mutate(label = get_integer())

C) Also here is a one-liner using a generator function abusing a global variable assignment from this:

i <- 0
generate_integer <- function() { return(assign('i', i+1, envir = .GlobalEnv)) }

df %>% group_by(u,v) %>% mutate(label = generate_integer())

rm(i)
Community
  • 1
  • 1
smci
  • 32,567
  • 20
  • 113
  • 146
  • The reason that I used `get_group_name` is to avoid using global variable. I think it is in general not a good idea to change global variables inside a function...but it works anyway. – Randy Lai Apr 12 '14 at 06:20
  • I compacted yours and put it at the top of my answer. An assignment evaluates to its LHS value, hence we can simply say `function(u,v){ i <<- i+1 }` – smci Apr 12 '14 at 06:45
  • I also found a neat three-liner non-dplyr way with `interaction(u,v)`, and added that at top. – smci Apr 12 '14 at 07:24
  • I also solved the incremental-order issue with `interaction(... drop=T)` per [this subquestion](http://stackoverflow.com/questions/23028406/how-to-reorder-arbitrary-integer-vector-to-be-in-increasing-order) – smci Apr 12 '14 at 09:23
2

I don't have enough reputation for a comment, so I'm posting an answer instead.

The solution using factor() is a good one, but it has the disadvantage that group numbers are assigned after factor() alphabetizes its levels. The same behaviour happens with dplyr's group_indices(). Perhaps you would like the group numbers to be assigned from 1 to n based on the current group order. In which case, you can use:

my_tibble %>% mutate(group_num = as.integer(factor(group_var, levels = unique(.$group_var))) )
prince_of_pears
  • 176
  • 1
  • 6
  • Thanks. As I noted in the question, this was all solved by adding `group_indices()` back in dplyr 0.4.0 in 2015 – smci Jun 29 '18 at 03:34