0

I have a data frame with two columns and 70,000 rows. One column serves an identifier for a household, column b in the example below. The other column refers to the individuals in the household, numbering them from 1 to n with some error (could be 1,2,3 or 1,4,5), column a in the example below.

I'm trying to use hierarchical clustering with the number of individuals in a household as a feature. The code I've written below counts the number of individuals in a household and puts them in the proper column and row, however takes several minutes with the actual data set I have, I assume due to its size. Is there a better way of going about getting this information?

fake.data <- data.frame(a = c(1,1,5,6,7,1,2,3,1,2,4), b = c("a", "a", "a", "a", "a", "b", "b", "b", "c", "c", "c"))
fake.cluster <- data.frame(b = unique(fake.data$b))
fake.cluster$members <- sapply(fake.cluster$b, function(x)     length(unique(subset(fake.data, fake.data$b == x)$a)))
IJH
  • 167
  • 1
  • 11

2 Answers2

1

Don't know if this is quicker, but you could use dplyr in various ways. One approach: get the distinct rows and then count b.

library(dplyr)
fake.cluster <- fake.data %>%
  distinct() %>%
  count(b)
neilfws
  • 32,751
  • 5
  • 50
  • 63
1

Here is an option using data.table

library(data.table)
setDT(fake.data)[, .(members = uniqueN(a)), b]
#   b members
#1: a       4
#2: b       3
#3: c       3
akrun
  • 874,273
  • 37
  • 540
  • 662