I have a dataset with 87 variables and about 9 million observations. The earliest years do not collect information on number of children. I have attempted to impute a value for number of children to these households. The function below tries to summarize the ratio between women of childbearing age and my imputed value, to compare with the Census estimates for those years. When I run the block of code below on my full data set,
library(tidyverse)
mid2 %>% filter(year < 1968) %>%
group_by(hh_id) %>%
summarise(hh_fem = .data$n_fem * (.data$pernum == 1),
hh_kids = .data$n_kids * (.data$pernum == 1)) %>%
summarise(tot_fem = sum(hh_fem),
totkids = sum(hh_kids)) -> fk
get this error:
Error in summarise_impl(.data, dots) :
Column `hh_fem` must be length 1 (a summary value), not 2
The initial restriction to years prior to 1968 limits the rows to the first 400-odd thousand. Looking at just the first five rows, I get no error and the answer I expect. By a process of trial and error, I determined that I could reproduce the error with just the first nine rows, and just the variables referenced in the function but not created there. These rows are reproduced below. The function works correctly on rows 1:8.
smidgen <- select(mid2[9, ], year, hh_id, n_fem, pernum, numprec, n_kids)
smidgen
# A tibble: 9 x 6
# Groups: hh_id [8]
year hh_id n_fem pernum numprec n_kids
<dbl> <chr> <int> <dbl> <dbl> <dbl>
1 1962 1962300001 1 1 1 0.9466731
2 1962 1962300002 0 1 1 0.0000000
3 1962 1962300003 0 1 1 0.0000000
4 1962 1962300004 0 1 1 0.0000000
5 1962 1962300005 0 1 1 0.0000000
6 1962 1962300006 0 1 1 0.0000000
7 1962 1962300007 0 1 1 0.0000000
8 1962 1962300008 2 1 2 1.8933462
9 1962 1962300008 2 2 2 1.8933462
Indeed, I generate the same error from rows 8:9 alone. Not, however, from either row 8 or row 9, taken separately.
I do not see anything on row 9 to cause this problem. Indeed, I don’t see how any values in row nine could change the width of hh_fem.
Advice and thoughts greatly appreciated.