I have a data frame with two columns and 70,000 rows. One column serves an identifier for a household, column b
in the example below. The other column refers to the individuals in the household, numbering them from 1 to n with some error (could be 1,2,3 or 1,4,5), column a
in the example below.
I'm trying to use hierarchical clustering with the number of individuals in a household as a feature. The code I've written below counts the number of individuals in a household and puts them in the proper column and row, however takes several minutes with the actual data set I have, I assume due to its size. Is there a better way of going about getting this information?
fake.data <- data.frame(a = c(1,1,5,6,7,1,2,3,1,2,4), b = c("a", "a", "a", "a", "a", "b", "b", "b", "c", "c", "c"))
fake.cluster <- data.frame(b = unique(fake.data$b))
fake.cluster$members <- sapply(fake.cluster$b, function(x) length(unique(subset(fake.data, fake.data$b == x)$a)))