3

This is a silly question but I am new to R and it would make my life so much easier if I could figure out how to do this! So here is some sample data

data <- read.table(text = "Category Y
 A 5.1
 A 3.14
 A 1.79
 A 3.21
 A 5.57
 B 3.68
 B 4.56
 B 3.32
 B 4.98
 B 5.82
 ",header = TRUE)

I want to add a column that counts the number of observations within a group. Here is what I want it to look like:

Category    Y    OBS
A          5.1    1
A          3.14   2
A          1.79   3
A          3.21   4
A          5.57   5
B          3.68   1
B          4.56   2
B          3.32   3
B          4.98   4
B          5.82   5

I have tried:

data <- data %>% group_by(Category) %>% mutate(count = c(1:length(Category)))

which just creates another column numbered from 1 to 10, and

data <- data %>% group_by(Category) %>% add_tally()

which just creates another column of all 5s

4 Answers4

3

Base R:

data$OBS <- ave(seq_len(nrow(data)), data$Category, FUN = seq_along)
data
#    Category    Y OBS
# 1         A 5.10   1
# 2         A 3.14   2
# 3         A 1.79   3
# 4         A 3.21   4
# 5         A 5.57   5
# 6         B 3.68   1
# 7         B 4.56   2
# 8         B 3.32   3
# 9         B 4.98   4
# 10        B 5.82   5

BTW: one can use any of the frame's columns as the first argument, including ave(data$Category, data$Category, FUN=seq_along), but ave chooses its output class based on the input class, so using a string as the first argument will result in a return of strings:

ave(data$Category, data$Category, FUN = seq_along)
#  [1] "1" "2" "3" "4" "5" "1" "2" "3" "4" "5"

While not heinous, it needs to be an intentional choice. Since it appears that you wanted an integer in that column, I chose the simplest integer-in, integer-out approach. It could also have used rep(1L,nrow(data)) or anything that is both integer and the same length as the number of rows in the frame, since seq_along (the function I chose) won't otherwise care.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    nicely explained, upvoted – AnilGoyal Mar 05 '21 at 15:46
  • Would this work in cases where the categories were non-sequential? – Daniel O Mar 06 '21 at 00:35
  • DanielO, yes, try it! There are techniques that require the `Category` variable to be clumped together without gaps, but I typically recommend against them, preferring something that is robust to that. This, [Sathish's](https://stackoverflow.com/a/66495433/3358272), and [Anigoyal's](https://stackoverflow.com/a/66495401/3358272) answers are all robust to disorder in Category; unfortunately, `rle` is not, it finds runs (of same-ness) in `Category`, so broken groups of a category will be numbered separately, unfortunately. – r2evans Mar 06 '21 at 10:16
1
library(dplyr) 
data %>% group_by(Category) %>% mutate(Obs = row_number()) 

# A tibble: 10 x 3
# Groups:   Category [2]
   Category     Y   Obs
   <chr>    <dbl> <int>
 1 A         5.1      1
 2 A         3.14     2
 3 A         1.79     3
 4 A         3.21     4
 5 A         5.57     5
 6 B         3.68     1
 7 B         4.56     2
 8 B         3.32     3
 9 B         4.98     4
10 B         5.82     5

OR

data$OBS <- ave(data$Category, data$Category, FUN = seq_along)

data
   Category    Y OBS
1         A 5.10   1
2         A 3.14   2
3         A 1.79   3
4         A 3.21   4
5         A 5.57   5
6         B 3.68   1
7         B 4.56   2
8         B 3.32   3
9         B 4.98   4
10        B 5.82   5
AnilGoyal
  • 25,297
  • 4
  • 27
  • 45
  • When I try that, I get an error Error: `n()` must only be used inside dplyr verbs. – yaynikkiprograms Mar 05 '21 at 15:39
  • 1
    @yaynikkiprograms, that suggests that the `mutate` you're using is not `dplyr::mutate`, or that you didn't use this code verbatim. (You cannot use `row_number()` outside of `mutate` or other dplyr verbs.) – r2evans Mar 05 '21 at 15:40
  • ave(rep(1, nrow(data)), data$Category, FUN=cumsum) – G5W Mar 05 '21 at 15:42
  • Your first code block is likely the most appropriate for the OP since they first demonstrated a dplyr attempt. – r2evans Mar 05 '21 at 15:51
1
library(data.table)
setDT(data)[, OBS := seq_len(.N), by = .(Category)]
data
   Category    Y OBS
 1:        A 5.10   1
 2:        A 3.14   2
 3:        A 1.79   3
 4:        A 3.21   4
 5:        A 5.57   5
 6:        B 3.68   1
 7:        B 4.56   2
 8:        B 3.32   3
 9:        B 4.98   4
10:        B 5.82   5
Sathish
  • 12,453
  • 3
  • 41
  • 59
0

Another base R

category <- c(rep('A',5),rep('B',5))
sequence <- sequence(rle(as.character(category))$lengths)
data <- data.frame(category=category,sequence=sequence)
head(data,10)
reusen
  • 491
  • 3
  • 11