0

So, I´m working with ENIGH - Database, which stands for ¨National Survey of Household Income and Expenses¨ in Spanish, this is an exercise conducted by the Mexican government and like most surveys of its kind, it works with Weights.

What I´m trying to do is to calculate the mean, maximum and minimum household income by Decile. In other words What´s the income of each 10%, grouping household base on their income. To be honest, I haven’t gone that far but this is what I got until now:

  1. I need my svydesign object
  2. Convert that into a table using svytable
  3. Arrange using desc() on my income variable
ENIGH_design <-svydesign(id=~upm, strata=~est_dis, weights=~factor_hog, data = ENIGH)
ENIGH_table <- svytable(ing_cor, ENIGH_design)

Here is where it gets tricky, supposing I have 100 rows, I can’t take the first 10 of them because in reality, when taking weights in mind, the might be 9% or 20% (I´m just throwing numbers) of the actual population.

I could use cut() on my income variable but I would be forgetting about weights and results will only be representative of the sample, not total population.

I think that the best approach would be to use a combination of:

  • mutate() to create a new variable base
  • if() in conjugation with mutate to define on which decile each row falls to
  • group_by() and mean() to calculate what I´m aiming for

This way I will have an extra variable which I could use to calculate whatever I want with whatever other variable I wish to. But again, I haven´t define my groups so it´s pretty much useless.

Thank you for reading. Thank you for your help.

Database available: https://www.inegi.org.mx/programas/enigh/nc/2016/default.html#Datos_abiertos

Here is a glimpse of how my DB looks:

folioviv    foliohog    ubica_geo   est_dis  upm  factor    ing_cor
100587003      1        10010000       2     610    180     22,723
100587004      1        10010000       2     610    180     17,920
100587005      1        10010000       2     610    180     27,506
100587006      1        10010000       2     610    180     56,236
100605201      1        10010000       2     620    178     41,587
100605202      1        10010000       2     620    178     135,437
100605203      1        10010000       2     620    178     62,386
100605205      1        10010000       2     620    178     103,502
100605206      1        10010000       2     620    178     27,323
100606301      1        10010000       3     630    223     68,042
100606302      1        10010000       3     630    223     98,537
100606305      1        10010000       3     630    223     53,237
100606306      1        10010000       3     630    223     132,861
100609801      1        10010000       3     640    232     190,033
100609802      1        10010000       3     640    232     28,654
100609805      1        10010000       3     640    232     74,408
100631401      1        10010000       1     650    171     80,761
100711503      1        10010000       1     770    184     38,640
100711504      1        10010000       1     770    184     81,672

There are many more columns but they aren´t necessary for this exercise.

René Martínez
  • 179
  • 1
  • 3
  • 11
  • 1
    Please provide a [reproducible example in r](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). The link I provided, will tell you how. – M-- May 06 '19 at 20:39
  • Ok, I did what I could, i dont have an actual example cuz I dont know how to to what I want to, but did provie the database link and my svydesign code – René Martínez May 06 '19 at 20:54
  • The download link is broken. Could you instead use example data, e.g. one of the sets that ships with `survey` so folks don't need to download anything aside from that package? – camille May 06 '19 at 21:25
  • Could you use `svyquantile` to figure out the decile breaks, then use those as breaks for calling `cut` on income? Then you'll have income brackets to group by to take means. The `srvyr` package has some `dplyr` verbs for `survey` – camille May 06 '19 at 21:30
  • @camille I added a glimpse of my DB. – René Martínez May 06 '19 at 21:48
  • @camille you mean, arrange on descending order base on income - ing_cor, then apply svyquantile on weight - factor column? wouldn´t svyquantile order my weight colum from min to max? If i were to use it on income - ing_cor, then it would happend what i stated on my post, i would take the first 25/100 rows but when looking at their weight, they might be 9%, 20% or 27% of the actual total population. – René Martínez May 06 '19 at 21:58
  • Possible duplicate of [Compute quantiles incorporating Sample Design (Survey package)](https://stackoverflow.com/questions/32167390/compute-quantiles-incorporating-sample-design-survey-package) – Anthony Damico May 06 '19 at 23:25
  • @AnthonyDamico in your answer, `~api00` is not the weighting colum am I right? Because on `svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)` the weighting colum is `~pw`. – René Martínez May 07 '19 at 15:19
  • hi, `pw` is the weight. mean of `api00` by quantile would be `svyby( ~ api00 , ~ qtile , dclus1 , svymean )` – Anthony Damico May 07 '19 at 15:56

1 Answers1

0

Make a table (dataframe or data.table or tibble) that looks like this:

> dt
folioviv    factor    ing_tri
       1       247      30000
       2       200      15000
       3       150      50000
incomes <- rep(dt$ing_tri, times = dt$factor)
deciles <- quantile(incomes, probs = seq(0.1, 1, by = 0.1), names = TRUE)

If I were you, I would try with names = FALSE to make it manipulable. Otherwise, it will be a named list and that's a bit annoying.

Oh, and in case you want to compute the mean, just do mean(incomes).

PS: The column folioviv is not actually necessary, but you may want to put it there just in case.

Arturo Sbr
  • 5,567
  • 4
  • 38
  • 76
  • The OP is working with a package (`survey`) that has specific handling for survey data—weights, stratification, etc. Your answer works for a small, non-survey project, but doesn't deal with those things the OP needs. – camille May 06 '19 at 21:24
  • 1
    @camille I did **exactly** the same thing with ENIGH for my undergrad thesis. I find `data.table` to be more than capable of handling ENIGH. – Arturo Sbr May 06 '19 at 21:29
  • That's fair, but then how do you handle the ID and strata that the OP has in their survey design? – camille May 06 '19 at 21:37
  • You keep `deciles` as a separate object and continue working with it as parameters to slice your svytable. – Arturo Sbr May 06 '19 at 21:40
  • @ArturoSbr so, your suggestion is to replicate each row times factor, in this case i would be having row 1, 247 times, row 2, 200 times and so on. Then you quantiles. `quantiels()` and `probs()` are from `data.table`? – René Martínez May 06 '19 at 22:55
  • @RenéMartínez Almost. The quantiles are calculated from `incomes`, which is the vector with the replicated incomes. The `probs` argument is just a sequence: `0.1, 0.2, ..., 0.9, 1`. – Arturo Sbr May 06 '19 at 23:09