Adding counts of a factor to a dataframe

Question

I have a data frame where each row is an observation concerning a pupil. One of the vectors in the data frame is an id for the school. I have obtained a new vector with counts for each school as follows:

tbsch <- table(dt$school)

Now I want to add the relevant count value to each row in dt. I have done it using for() looping through each row in dt and making a new vector containing the relevant count and finally using cbind() to add it to dt, but I think this is very inefficient. Is there a smart/easy way to do that ?

As per advice in meta, I am just adding the comment that I would like the order of observations to be preserved. — Joe King, Jul 01 '12 at 17:05

score 8 · Accepted Answer · answered Jul 01 '12 at 12:29

8

using jmsigner's data you could do:

dt$count <- ave(dt$school, dt$school,  FUN = length)

answered Jul 01 '12 at 12:29

Tyler Rinker

108,132
65
322
519

Thanks, this looks really promising as it doesn't rely on a package - however I can't get it to work for me. Could it be because `school` is alphanumeric ? The `str` looks like `school : Factor w/ 247 levels "ABD","ABI","BHX",..: 142` – Joe King Jul 01 '12 at 16:59
It's hard to say without seeing your data. I see you're fairly new to stack overflow so may I make the suggestion that you make your problem reproducible. Without a data set to work with it makes helping you more difficult. May I suggest you see this [LINK](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for direction on how to do this. – Tyler Rinker Jul 01 '12 at 17:49
This should work for you: `dt$count <- ave(as.numeric(dt$school), dt$school, FUN = length) ` – Tyler Rinker Jul 01 '12 at 18:15
BTW, +1 , now I have some rep. – Joe King Jul 02 '12 at 10:26

BenBarnes · Answer 2 · 2015-06-17T17:55:31.097

3

This is a lot easier in data.table v1.8.1. := now works by group. Groups don't have to be contiguous and it retains the original order. And it's just one line:

library(data.table)

# set up data
set.seed(2)
npupils <- rpois(10, 20)
pupil <- unlist(lapply(npupils, seq_len))
school <- rep(seq_along(npupils), npupils)
dt <- data.table(school = school, pupil = pupil) # Create a data.table
dt <- dt[sample(seq_len(nrow(dt)))] # Mix it up

dt
     school pupil
  1:      5     2
  2:      6    13
  3:      2    14
  4:      5     3
  5:     10    14
 ---             
186:      3    11
187:      7     2
188:      8    12
189:      3     6
190:      7    10

(dt[, schoolSize := .N, by = school])

     school pupil schoolSize
  1:      5     2         16
  2:      6    13         18
  3:      2    14         15
  4:      5     3         16
  5:     10    14         24
 ---                        
186:      3    11         14
187:      7     2         28
188:      8    12         19
189:      3     6         14
190:      7    10         28

That has all the usual speed advantages of fast grouping, and assigns the new column by reference with no copy at all.

Edit: Deleted an answer that was only relevant for data.table prior to version 1.8.1: (Thanks to Matthew for the update).

edited Jun 17 '15 at 17:55

answered Jul 01 '12 at 12:20

BenBarnes

19,114
6
56
74

Hi @MatthewDowle, thanks for the update with 1.8.1 - I thought about putting it in the original answer, but since it wasn't on CRAN yet, I thought I'd wait. Thanks!! – BenBarnes Jul 02 '12 at 05:05
That's nice. data.table looks very interesting, I will check it out. I would also upvote but I don't have any reputation :( – Joe King Jul 02 '12 at 06:15
BTW, +1 , now I have some rep. – Joe King Jul 02 '12 at 10:28
@Joe and Ben : v1.8.2 is now on CRAN. – Matt Dowle Jul 18 '12 at 12:27
I guess the part before Matthew's edit is now obsolete. I think there'd be no harm in removing it. I just pointed to this question from a dupe http://stackoverflow.com/questions/30894636/anyone-knows-how-to-get-the-counts-of-every-element-of-a-column-in-it-self and it would be nice to have the right `data.table`-ish answer stand out more if this is to be the canonical ref. – Frank Jun 17 '15 at 15:19
1

@Frank, thanks. I'll edit once I get to a larger screen. – BenBarnes Jun 17 '15 at 15:25

johannes · Answer 3 · 2012-07-01T11:07:18.363

2

You could try something like this:

dt <- data.frame(p=1:20, school=sample(1:5, 20, replace=T)) 
tbsch <- table(dt$school)

tbsch <- data.frame(tbsch)

merge(dt, tbsch, by.x="school", by.y="Var1")

edited Jul 01 '12 at 11:07

answered Jul 01 '12 at 08:34

johannes

14,043
5
40
51

Thanks ! Is there a way to keep the original order of the observations, without having to do a sort afterwards ? My problem is that I don't have consecutive IDs. Of course, I suppose I can add a column of sequential IDs and sort on it, but I prefer not to if possible... ? – Joe King Jul 01 '12 at 09:31
Have a look at `?merge` particularly for the argument `sort`. – johannes Jul 01 '12 at 09:33
Thanks again, but if `sort=F` then the result is still not in the original data order. – Joe King Jul 01 '12 at 09:59

score 1 · Answer 4 · answered Jul 01 '12 at 10:16

1

You can also use plyr...and preserve the original order using this one liner

join(dt, count(dt, "school"))

answered Jul 01 '12 at 10:16

dickoa

18,217
3
36
50

Thanks, but what package is the function `count()` in ? – Joe King Jul 01 '12 at 11:22
@JoeKing: plyr - http://cran.r-project.org/web/packages/plyr/index.html – sgibb Jul 01 '12 at 13:31
Thanks @sgibb , stupidly I had installed the package but not loaded it ! – Joe King Jul 01 '12 at 14:32
BTW, +1 , now I have some rep. FYI, I didn't accept this as the answer, because the other answer didn't require a package – Joe King Jul 02 '12 at 10:27

Adding counts of a factor to a dataframe

4 Answers4

Linked

Related