Compute quantiles incorporating Sample Design (Survey package)

Question

I want to compute a new column using the quantiles of another column (a continuous variable) incorporating the Sample Design of a complex survey. The idea is to create in the the data frame a new variable that indicates which quantile group each observation falls into

Here is how I execute the idea without incorporating the sample design, so you can understand what I'm aiming for.

# Load Data
  data(api)


# Convert data to data.table format (mostly to increase speed of the process)
  apiclus1 <- as.data.table(apiclus1)

# Create deciles variable
apiclus1[, decile:=cut(api00,
                       breaks=quantile(api00,
                                       probs=seq(0, 1, by=0.1), na.rm=T),
                       include.lowest= TRUE, labels=1:10)]

I've tried using svyquantile from the survey package, but I couldn't get my head around this problem. This code does not return the quantile groups as an output that I can feed into a new variable. Any thoughts on this?

# Load Package
 library(survey)

# create survey design
 dclus1 <- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)

# What I've tried to do
  svyquantile(~api00, design = dclus1, quantiles = seq(0, 1, by=0.1), method = "linear", ties="rounded")

Anthony Damico · Accepted Answer · 2015-09-05T17:18:31.033

2

library(survey)

data(api)

dclus1 <- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)

a <- svyquantile(~api00, design = dclus1, quantiles = seq(0, 1, by=0.1), method = "linear", ties="rounded")

# use factor() and findInterval()
dclus1 <- update( dclus1 , qtile = factor( findInterval( api00 , a ) ) )

# distribution
svymean( ~ qtile , dclus1 )


# or without the one observation in group number 11
dclus1 <- update( dclus1 , qtile = factor( findInterval( api00 , a[ -length( a ) ] ) ) )

# distribution
svymean( ~ qtile , dclus1 )



# quantiles by group
b <- svyby(~api00, ~stype, design = dclus1, svyquantile, quantiles = seq(0, 0.9 , by=0.1) ,ci=T)

# copy over your data
x <- apiclus1

# stype of each record
match( x$stype , b$stype ) 

# create the new qtile variable
x$qtile_by_stype <- factor( diag( apply( data.frame( b )[ match( x$stype , b$stype ) , 2:11 ] , 1 , function( v , w ) findInterval( w , v ) , x$api00 ) ) )

# re-create the survey design
dclus1 <- svydesign(id=~dnum, weights=~pw, data=x, fpc=~fpc)

# confirm you have quantiles
svyby( ~ qtile_by_stype , ~ stype , dclus1 , svymean )

edited Sep 05 '15 at 17:18

answered Aug 24 '15 at 01:48

Anthony Damico

5,779
7
46
77

Thank @Anthony. Any idea on how to do this by subgroups? For the 1st part of extracting the quantiles I thought using this `b <- svyby(~api00, ~stype, design = dclus1, svyquantile, quantiles = seq(0, 1, by=0.1), method = "linear", ties="rounded", na.rm= T, ci=TRUE)` But I confess I have no clue on how to use this object `b` to update the values in the survey design or in the dataset itself – rafa.pereira Sep 05 '15 at 13:27
@RafaelPereira sort of a ugly but certainly possible.. note the `2:11` has been hardcoded. look at `data.frame(b)` to examine why those columns were selected for this specific example – Anthony Damico Sep 05 '15 at 17:19
in your code to calculate deciles by group , why did you write `seq(0, 0.9 , by=0.1)` , instead of `seq(0, 10 , by=0.1)` ? – rafa.pereira Jan 13 '17 at 11:46

AntoniosK · Answer 2 · 2015-08-23T20:38:44.107

The output from your whole code above is :

        0   0.1   0.2   0.3   0.4    0.5   0.6    0.7   0.8   0.9   1
api00 411 497.8 535.6 573.2 614.6 651.75 686.6 709.55 735.4 780.7 905

You can change the names to represent your groups. 0 and 1 represent minimum and maximum. 0.1 represents decile 1, 0.2 represents decile 2, etc. Something like:

dt_quantile = svyquantile(~api00, design = dclus1, quantiles = seq(0, 1, by=0.1), method = "linear", ties="rounded")
dt_quantile = data.table(dt_quantile)

setnames(dt_quantile, c("min",paste0("decile",1:10)))

dt_quantile = data.table(t(dt_quantile), keep.rownames = T)

dt_quantile 

#         rn     V1
# 1:      min 411.00
# 2:  decile1 497.80
# 3:  decile2 535.60
# 4:  decile3 573.20
# 5:  decile4 614.60
# 6:  decile5 651.75
# 7:  decile6 686.60
# 8:  decile7 709.55
# 9:  decile8 735.40
# 10: decile9 780.70
# 11: decile10 905.00

Am I missing your objective?

Thank you @AntoniosK, but the idea is actually to create in the the data frame a new variable that indicates which quantile group each observation falls into . — rafa.pereira, Aug 23 '15 at 21:33
You're right. I think it just computes the quantiles, without joining back the info of which quantile each row belongs to. That's what the function is for. However you have the info inside dclus1$variables, which you can use as a dataset and apply your method. — AntoniosK, Aug 23 '15 at 21:50

Compute quantiles incorporating Sample Design (Survey package)

2 Answers2

Linked