3

I have a data frame where each row is an observation concerning a pupil. One of the vectors in the data frame is an id for the school. I have obtained a new vector with counts for each school as follows:

tbsch <- table(dt$school)

Now I want to add the relevant count value to each row in dt. I have done it using for() looping through each row in dt and making a new vector containing the relevant count and finally using cbind() to add it to dt, but I think this is very inefficient. Is there a smart/easy way to do that ?

Arun
  • 116,683
  • 26
  • 284
  • 387
Joe King
  • 2,955
  • 7
  • 29
  • 43
  • 1
    As per advice in meta, I am just adding the comment that I would like the order of observations to be preserved. – Joe King Jul 01 '12 at 17:05

4 Answers4

8

using jmsigner's data you could do:

dt$count <- ave(dt$school, dt$school,  FUN = length)
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • Thanks, this looks really promising as it doesn't rely on a package - however I can't get it to work for me. Could it be because `school` is alphanumeric ? The `str` looks like `school : Factor w/ 247 levels "ABD","ABI","BHX",..: 142` – Joe King Jul 01 '12 at 16:59
  • It's hard to say without seeing your data. I see you're fairly new to stack overflow so may I make the suggestion that you make your problem reproducible. Without a data set to work with it makes helping you more difficult. May I suggest you see this [LINK](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for direction on how to do this. – Tyler Rinker Jul 01 '12 at 17:49
  • This should work for you: `dt$count <- ave(as.numeric(dt$school), dt$school, FUN = length) ` – Tyler Rinker Jul 01 '12 at 18:15
  • BTW, +1 , now I have some rep. – Joe King Jul 02 '12 at 10:26
3

This is a lot easier in data.table v1.8.1. := now works by group. Groups don't have to be contiguous and it retains the original order. And it's just one line:

library(data.table)

# set up data
set.seed(2)
npupils <- rpois(10, 20)
pupil <- unlist(lapply(npupils, seq_len))
school <- rep(seq_along(npupils), npupils)
dt <- data.table(school = school, pupil = pupil) # Create a data.table
dt <- dt[sample(seq_len(nrow(dt)))] # Mix it up

dt
     school pupil
  1:      5     2
  2:      6    13
  3:      2    14
  4:      5     3
  5:     10    14
 ---             
186:      3    11
187:      7     2
188:      8    12
189:      3     6
190:      7    10

(dt[, schoolSize := .N, by = school])

     school pupil schoolSize
  1:      5     2         16
  2:      6    13         18
  3:      2    14         15
  4:      5     3         16
  5:     10    14         24
 ---                        
186:      3    11         14
187:      7     2         28
188:      8    12         19
189:      3     6         14
190:      7    10         28

That has all the usual speed advantages of fast grouping, and assigns the new column by reference with no copy at all.


Edit: Deleted an answer that was only relevant for data.table prior to version 1.8.1: (Thanks to Matthew for the update).

BenBarnes
  • 19,114
  • 6
  • 56
  • 74
  • Hi @MatthewDowle, thanks for the update with 1.8.1 - I thought about putting it in the original answer, but since it wasn't on CRAN yet, I thought I'd wait. Thanks!! – BenBarnes Jul 02 '12 at 05:05
  • That's nice. data.table looks very interesting, I will check it out. I would also upvote but I don't have any reputation :( – Joe King Jul 02 '12 at 06:15
  • BTW, +1 , now I have some rep. – Joe King Jul 02 '12 at 10:28
  • @Joe and Ben : v1.8.2 is now on CRAN. – Matt Dowle Jul 18 '12 at 12:27
  • I guess the part before Matthew's edit is now obsolete. I think there'd be no harm in removing it. I just pointed to this question from a dupe http://stackoverflow.com/questions/30894636/anyone-knows-how-to-get-the-counts-of-every-element-of-a-column-in-it-self and it would be nice to have the right `data.table`-ish answer stand out more if this is to be the canonical ref. – Frank Jun 17 '15 at 15:19
  • 1
    @Frank, thanks. I'll edit once I get to a larger screen. – BenBarnes Jun 17 '15 at 15:25
2

You could try something like this:

dt <- data.frame(p=1:20, school=sample(1:5, 20, replace=T)) 
tbsch <- table(dt$school)

tbsch <- data.frame(tbsch)

merge(dt, tbsch, by.x="school", by.y="Var1")
johannes
  • 14,043
  • 5
  • 40
  • 51
  • Thanks ! Is there a way to keep the original order of the observations, without having to do a sort afterwards ? My problem is that I don't have consecutive IDs. Of course, I suppose I can add a column of sequential IDs and sort on it, but I prefer not to if possible... ? – Joe King Jul 01 '12 at 09:31
  • Have a look at `?merge` particularly for the argument `sort`. – johannes Jul 01 '12 at 09:33
  • Thanks again, but if `sort=F` then the result is still not in the original data order. – Joe King Jul 01 '12 at 09:59
1

You can also use plyr...and preserve the original order using this one liner

join(dt, count(dt, "school"))
dickoa
  • 18,217
  • 3
  • 36
  • 50