Better subsetting and counting values in a dataframe

Question

I have a data frame with two columns and 70,000 rows. One column serves an identifier for a household, column b in the example below. The other column refers to the individuals in the household, numbering them from 1 to n with some error (could be 1,2,3 or 1,4,5), column a in the example below.

I'm trying to use hierarchical clustering with the number of individuals in a household as a feature. The code I've written below counts the number of individuals in a household and puts them in the proper column and row, however takes several minutes with the actual data set I have, I assume due to its size. Is there a better way of going about getting this information?

fake.data <- data.frame(a = c(1,1,5,6,7,1,2,3,1,2,4), b = c("a", "a", "a", "a", "a", "b", "b", "b", "c", "c", "c"))
fake.cluster <- data.frame(b = unique(fake.data$b))
fake.cluster$members <- sapply(fake.cluster$b, function(x)     length(unique(subset(fake.data, fake.data$b == x)$a)))

score 1 · Accepted Answer · answered Mar 30 '17 at 02:40

1

Don't know if this is quicker, but you could use dplyr in various ways. One approach: get the distinct rows and then count b.

library(dplyr)
fake.cluster <- fake.data %>%
  distinct() %>%
  count(b)

answered Mar 30 '17 at 02:40

neilfws

32,751
5
50
63

you are my hero – IJH Mar 30 '17 at 03:00

score 1 · Answer 2 · answered Mar 30 '17 at 02:58

1

Here is an option using data.table

library(data.table)
setDT(fake.data)[, .(members = uniqueN(a)), b]
#   b members
#1: a       4
#2: b       3
#3: c       3

answered Mar 30 '17 at 02:58

akrun

874,273
37
540
662

Better subsetting and counting values in a dataframe

2 Answers2