1

Suppose I have a data frame such like:

set.seed(123)
df<-data.frame(y=sample( c("A","B","C"), 10, T), 
                X=sample(c (1,2,3), 10, T))
   y X
1  A 3
2  C 2
3  B 3
4  C 2
5  C 1
6  A 3
7  B 1
8  C 1
9  B 1
10 B 3

what I wanted is to add a column z which summarize the items' length of column y such as:

   y X z
1  A 3 2
2  C 2 4
3  B 3 4
4  C 2 4
5  C 1 4
6  A 3 2
7  B 1 4
8  C 1 4
9  B 1 4
10 B 3 4

which means there are 2 As, 4 Cs and 4 Bs.

David Z
  • 6,641
  • 11
  • 50
  • 101

3 Answers3

2

We can use data.table to create the column 'z' based on the number of elements (.N) for each 'y'.

library(data.table)
DT <- as.data.table(df)
DT[, z:= .N, by = y]
DT
#    y X z
# 1: A 3 2
# 2: C 2 4
# 3: B 3 4
# 4: C 2 4
# 5: C 1 4
# 6: A 3 2
# 7: B 1 4
# 8: C 1 4
# 9: B 1 4
#10: B 3 4

Or using dplyr, we group by 'y' and create a new column 'z' with mutate. The dplyr equivalent to .N is n().

library(dplyr)
df %>%
   group_by(y) %>%
   mutate(z = n())
akrun
  • 874,273
  • 37
  • 540
  • 662
2
df$z=table(df$y)[df$y]
df
#    y X z
# 1  A 3 2
# 2  C 2 4
# 3  B 3 4
# 4  C 2 4
# 5  C 1 4
# 6  A 3 2
# 7  B 1 4
# 8  C 1 4
# 9  B 1 4
# 10 B 3 4

With table we are able to get both the counts and the names of each element of the df$y column. So that saves steps along the way. We are leveraging the strength of being able to both subset by indices and names. In this case, the column is of the class factor, but the above will also work if they were as.character.

Pierre L
  • 28,203
  • 6
  • 47
  • 69
  • I'm not sure regarding the names. Try `as.vector(table(df$y))[df$y]`, for example. `as.vector` strips the names. In this case it seem to work due to the fact that `df$y` is an integer (factor) in this example. Though when `y` is of class `character` it seem to work according to names. Quite tricky. – David Arenburg Aug 26 '15 at 13:44
  • Yes @DavidArenburg , the factor is working underneath. But something cool is that even if the strings were characters the name subset would work but not the value. `table(df$y)[as.character(df$y)]` will still work due to name subsetting. But `as.vector(table(df$y))[as.character(df$y)]` won't. – Pierre L Aug 26 '15 at 13:49
  • Yes, I just wrote it in my comment above :). Worth investigation :) – David Arenburg Aug 26 '15 at 13:52
  • @DavidArenburg Just saw the edit. Flexible subsetting is the type of R feature that keeps people interested and puts R above other platforms. That along with matrix operations really show the power of the language :) – Pierre L Aug 26 '15 at 13:55
  • Though it may introduce some inconsistency... – David Arenburg Aug 26 '15 at 13:57
  • @DavidArenburg how so? – Pierre L Aug 26 '15 at 14:02
  • Because some types it operates over names of the table and sometimes over its values. I'm not sure its very consistent. – David Arenburg Aug 26 '15 at 14:03
1

Here's a simple approach using a for loop:

for (i in levels(df$y)) df$z[df$y==i] <- sum(df$y==i)  
#> df
#   y X z
#1  A 3 2
#2  C 2 4
#3  B 3 4
#4  C 2 4
#5  C 1 4
#6  A 3 2
#7  B 1 4
#8  C 1 4
#9  B 1 4
#10 B 3 4
RHertel
  • 23,412
  • 5
  • 38
  • 64