2

I am struggling to number subgroups in a dataframe.

Let's take iris dataset for example. Let's say that iris$Species identifies my subgroups (so I have three subgroups: setosa, versicolor, virginica). Now I would like to add another column to iris, let's say numer of observations: iris$Obs. And for each of the subgroups, I'd like to have numbers, from 1 to the length of a subgroup, that resets to 1 when the subgroup changes.

In other words, I'd like the "Obs" to start at 1 whenever "Species" changes and increment by one, as long as "Species" is the same.

I prepared an image, but being a complete noob I have no reputation points to paste it here...

Thank everyone for help!

EDIT: > dput(iris)

structure(list(Species = structure(c(1L, 1L, 1L, 2L, 2L, 3L), .Label = c("setosa", 
"versicolor", "virginica"), class = "factor")), .Names = "Species", row.names = c(NA, 
-6L), class = "data.frame")
Phil
  • 4,344
  • 2
  • 23
  • 33
rpl
  • 451
  • 4
  • 13

2 Answers2

2

You can try getanID from my "splitstackshape" package.

For what you describe, the code would be:

getanID(iris, "Species")

What it looks like where the group is changing:

getanID(iris, "Species")[45:55]
#     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species .id
#  1:          5.1         3.8          1.9         0.4     setosa  45
#  2:          4.8         3.0          1.4         0.3     setosa  46
#  3:          5.1         3.8          1.6         0.2     setosa  47
#  4:          4.6         3.2          1.4         0.2     setosa  48
#  5:          5.3         3.7          1.5         0.2     setosa  49
#  6:          5.0         3.3          1.4         0.2     setosa  50
#  7:          7.0         3.2          4.7         1.4 versicolor   1
#  8:          6.4         3.2          4.5         1.5 versicolor   2
#  9:          6.9         3.1          4.9         1.5 versicolor   3
# 10:          5.5         2.3          4.0         1.3 versicolor   4
# 11:          6.5         2.8          4.6         1.5 versicolor   5

Under the hood, it's essentially along the lines of:

library(data.table)
as.data.table(iris)[, ID := sequence(.N), by = Species]

Or if you prefer "dplyr":

iris %>%
  group_by(Species) %>%
  mutate(ID = sequence(n()))
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
2

1) ave Try ave:

transform(iris, Obs = ave(c(Species), Species, FUN = seq_along))

The first argument of ave can be anything as long as it is a numeric vector with one element per row. For example, we could have used 1:nrow(iris), numeric(nrow(iris)) or Sepal.Length. In this case Species is a "factor" and c(Species) is "numeric". The rows of each group need not be contiguous.

2) match Another possibility is this which subtracts the position of the first occurrence of each group from its sequence number and adds 1:

transform(iris, Obs = seq_along(Species) - match(Species, Species) + 1)

If the +1 is omitted the it will give numbers starting from 0 rather than 1. This solution requires that the rows of each group be contiguous.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341