1

I have a dataset with 87 variables and about 9 million observations. The earliest years do not collect information on number of children. I have attempted to impute a value for number of children to these households. The function below tries to summarize the ratio between women of childbearing age and my imputed value, to compare with the Census estimates for those years. When I run the block of code below on my full data set,

library(tidyverse)
mid2  %>% filter(year < 1968)   %>% 
  group_by(hh_id)    %>% 
  summarise(hh_fem  = .data$n_fem  * (.data$pernum == 1), 
           hh_kids  = .data$n_kids * (.data$pernum == 1))  %>%   
  summarise(tot_fem = sum(hh_fem), 
            totkids = sum(hh_kids))    ->  fk

get this error:

Error in summarise_impl(.data, dots) : 
  Column `hh_fem` must be length 1 (a summary value), not 2

The initial restriction to years prior to 1968 limits the rows to the first 400-odd thousand. Looking at just the first five rows, I get no error and the answer I expect. By a process of trial and error, I determined that I could reproduce the error with just the first nine rows, and just the variables referenced in the function but not created there. These rows are reproduced below. The function works correctly on rows 1:8.

smidgen <- select(mid2[9, ], year, hh_id, n_fem, pernum, numprec, n_kids)

smidgen
# A tibble: 9 x 6
# Groups:   hh_id [8]
   year      hh_id n_fem pernum numprec    n_kids
  <dbl>      <chr> <int>  <dbl>   <dbl>     <dbl>
1  1962 1962300001     1      1       1 0.9466731
2  1962 1962300002     0      1       1 0.0000000
3  1962 1962300003     0      1       1 0.0000000
4  1962 1962300004     0      1       1 0.0000000
5  1962 1962300005     0      1       1 0.0000000
6  1962 1962300006     0      1       1 0.0000000
7  1962 1962300007     0      1       1 0.0000000
8  1962 1962300008     2      1       2 1.8933462
9  1962 1962300008     2      2       2 1.8933462

Indeed, I generate the same error from rows 8:9 alone. Not, however, from either row 8 or row 9, taken separately.

I do not see anything on row 9 to cause this problem. Indeed, I don’t see how any values in row nine could change the width of hh_fem.

Advice and thoughts greatly appreciated.

andrewH
  • 2,281
  • 2
  • 22
  • 32

1 Answers1

0

The error is in the first call to summarise:

 summarise(hh_fem  = .data$n_fem  * (.data$pernum == 1)

If you want to filter the observation in the variable pernum to just the ones with a value of 1, you have to call first the filter() function and then the summarise() function

Your code should work with this change:

 mid2  %>% filter(year < 1968)   %>% 
 filter(pernum == 1) %>%
 group_by(hh_id)    %>% 
 summarise(hh_fem  = n_fem  * pernum, 
           hh_kids  = n_kids * pernum)  %>%   
 summarise(tot_fem = sum(hh_fem), 
        totkids = sum(hh_kids))    ->  fk

For the future please post a reprex in order to easily reproduce your problem

Antonio
  • 327
  • 1
  • 11
  • Thanks Antonio! That's fantastic. Could you possibly give me some hint of why my code didn't work, and why it produced the error that it did? I understand why your code works, but not why mine does not, and I'd like to avoid similar errors in the future. I put a fair amount of effort into producing what I thought was a reproducible example, though i see i forgot the library call too tidyverse. Is that what you are referring to? Or is it now preferred practice to priduce such examples specic – andrewH Dec 08 '17 at 19:24
  • I guess that the reason why your code brekas is that you try to subset the variable `pernum` within the call to `summarise()`, which is not allowed. As a consequence, `summarise()` gives an error because it is designed to give in output a single observation (a summary value) per variable per each level of the grouping variable. For what concerns the `reprex`, just make a call to `dput()` next time, so it's easy to recreate your dataset ;) – Antonio Dec 08 '17 at 19:31
  • ooops! hit return. continuing: specifically byusing the regex package? – andrewH Dec 08 '17 at 19:32
  • Thanks, Antonio! Re dput: will do.. The reason I am still confused is that, although you are correct that my purpose was subsetting, I thought my code simply did element-by-element multiplication, leaving a zero in all the rows I did not want, which could be summed away in the next summary. I don't understand how element-by-element multiplication can give a scalar on line 8 and a length-2 vector on line 9. At least, that is what the error message seems to suggest. – andrewH Dec 08 '17 at 19:46
  • The thing is that the purpose of `summarise()` is to give you with just a single value. If you do an element-wise operation the expected result will be a vector of length equal to the input vectors. In your code you are trying to do a vector multiplication whose result will be a vector of length 2, generating the conflict. For a better explanation of what `summarise()` does take a look to [this chapter](http://r4ds.had.co.nz/transform.html) or other `tidyverse` resources. – Antonio Dec 08 '17 at 19:55
  • OK, I think I got it. Because n_fem is duplicated in each row within a household, to get the right answer, I have to zero out the duplicate values. Then I thought the second summarise would sum out the zeros. But although the formula for hh_fem always produces a scalar, that formula applied element by element within a group produces a vector, which is not what summarise wants. Since your last comment, I have been looking for a way to do the summation of the calculated values inside the group, rather than the selection outside, without success. Wrapping in sum() just sums individual lines. – andrewH Dec 08 '17 at 21:43