2

I have very little programming experience, but I'm working on a statistics project and would like to generate an unequal probability sample where the inclusion probability of a unit is based on its size (PPS).

Basically, I have two datasets:

  • ds1 lists US states and the parameter I'm trying to estimate
  • ds2 has the population size of each state.

My questions:

  1. I want to use R to select a random sample from the first dataset using inclusion probabilities based on the population of each state (second dataset).

  2. Also is there any way to use R to calculate these Generalized Unequal Probability Estimator formulas?

Generalized Unequal Probability Estimator Estimated Variance of Generalized Unequal Probability Estimator

Also just a note on the formulas: pi_i is inclusion probability and pi_ij is joint inclusion probability.

smci
  • 32,567
  • 20
  • 113
  • 146
Jessica Wu
  • 21
  • 3
  • Links are ephemeral, it's not ok to reference something in text in a linked image. State what it is: *"link to formulas for Generalized Unequal Probability Estimator"* Anyway what is `y` supposed to: the independent variable, the population or what? – smci Dec 03 '17 at 02:51
  • Hi, sorry I didn't know. I edited it to get rid of the link. Also y is the variable of interest. Thank you for your help! – Jessica Wu Dec 03 '17 at 03:00
  • No problem. Your second question should be asked as a separate question, and is offtopic on SO, or at least won't get a great response - best to ask statistical questions at sister site [CrossValidated](https://statistics.stackexchange.com) – smci Dec 03 '17 at 03:23

2 Answers2

0

Yes, that's called weighted sampling. Simply set the weight to the size of the state, strictly you don't even need to normalize them by 1/sum(sizes) although it's always good practice to. There are tons of duplicate posts on SO showing how to do weighted sampling.

The only tiny complication is that you need to do a join() of the datasets ds1, ds2. Show us what code you've tried if it's causing problems. Recommend you use either dplyr or data.table.

Your second question should be asked as a separate question, and is offtopic on SO, or at least won't get a great response - best to ask statistical questions at sister site CrossValidated

smci
  • 32,567
  • 20
  • 113
  • 146
0

There is a package for the same in R - pps and the documentation is here.

Also, there is another package called survey with a bit of documentation here.

I'm not sure of the difference between the two and haven't used them myself. Hope this is what you're looking for.

bsrcube
  • 83
  • 1
  • 7