1

I have a vector that looks like this:

a <- c("A110","A110","A110","B220","B220","C330","D440","D440","D440","D440","D440","D440","E550")

I would like to create another another vector, based on a, that should look like:

b <- c(1,1,1,2,2,2,3,4,4,4,4,4,4,5)

In other words, b should assign a value (starting from 1) to each different element of a.

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
Gaspare
  • 155
  • 1
  • 1
  • 8
  • I changed it by mistake wanting to amend the original question. Open to suggestions to make it better. – Gaspare May 03 '16 at 19:32

1 Answers1

8

First of all, (I assume) this is your vector

a <- c("A110","A110","A110","B220","B220","C330","D440","D440","D440","D440","D440","D440","E550")

As per possible solutions, here are few (can't find a good dupe right now)

as.integer(factor(a))
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5

Or

cumsum(!duplicated(a))
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5

Or

match(a, unique(a))
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5

Also rle will work the similarly in your specific scenario

with(rle(a), rep(seq_along(values), lengths))
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5

Or (which is practically the same)

data.table::rleid(a)
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5

Though be advised that all 4 solutions have their unique behavior in different scenarios, consider the following vector

a <- c("B110","B110","B110","A220","A220","C330","D440","D440","B110","B110","E550")

And the results of the 4 different solutions:

1.

as.integer(factor(a))
# [1] 2 2 2 1 1 3 4 4 2 2 5

The factor solution begins with 2 because a is unsorted and hence the first values are getting higher integer representation within the factor function. Hence, this solution is only valid if your vector is sorted, so don't use it other wise.

2.

cumsum(!duplicated(a))
# [1] 1 1 1 2 2 3 4 4 4 4 5

This cumsum/duplicated solution got confused because of "B110" already been present at the beginning and hence grouped "D440","D440","B110","B110" into the same group.

3.

match(a, unique(a))
# [1] 1 1 1 2 2 3 4 4 1 1 5

This match/unique solution added ones at the end, because it is sensitive to "B110" showing up in more than one sequences (because of unique) and hence grouping them all into same group regardless of where they appear

4.

with(rle(a), rep(seq_along(values), lengths))
# [1] 1 1 1 2 2 3 4 4 5 5 6

This solution only cares about sequences, hence different sequences of "B110" were grouped into different groups

David Arenburg
  • 91,361
  • 17
  • 137
  • 196