0

This is similar but not equal to Using weights in R to consider the inverse of sampling probability.

I have a long data frame and this is a part of the real data:

age gender labour_situation industry_code FACT FACT_2....
35  M      unemployed       15            1510
21  F      inactive         00            651

FACT is a variable that means, for the first row, that a male unemployed individual of 35 years represents 1510 individuals of the population.

I need to obtain some tables to show relevant information like the % of employed and unemployed people, etc. In Stata there are some options like tab labour_situation [w=FACT] that shows the number of employed and unemployed people in the population while tab labour_situation shows the number of employed and unemployed people in the sample.

A partial solution could be to repeat the 1st row of the data frame 1510 times and then the 2nd row of my data frame 651 times? As I've searched one options is to run

longdata <- data[rep(1:nrow(data), data$FACT), ]
employment_table = with(longdata, addmargins(table(labour_situation, useNA = "ifany")))

The other thing I need to do is to run a regression having in mind that there was cluster sampling in the following way: the population was divided in regions. This creates a problem: one individual interviewed in foo+bar represents foo+bar people while an individual interviewed in foo+bar represents foo+bar people but foo+bar and foo+bar are not in proportion to the total population of each region, so some regions will be overrepresented and other regions will be underrepresented. In order to take this into account, each observation should be weighted by the inverse of its probability of being sampled.

The last paragraph means that the model foo+bar can be estimated with valid equations foo+bar BUT the variance-covariance matrix won't be foo+bar but foo+bar if I consider the inverse of sampling probability.

In Stata it is possible to run a regression by doing reg y x1 x2 [pweight=n] and that calculates the right variance-covariance matrix considering the inverse of sampling probability. At the time I have to use Stata for some part of my work and R for others. I'd like to use just R.

Community
  • 1
  • 1
pachadotdev
  • 3,345
  • 6
  • 33
  • 60
  • 2
    Try `DF[rep(1:nrow(DF), DF$FACT), ]` – Frank Jun 13 '16 at 19:55
  • The last comment is likely to irritate everyone, Stata users because of the dig at Stata and R users because of your implication that R can't do something. More seriously for everyone, I am not clear how any of the answers details how their solution works for non-integers, but I am no kind of R expert. – Nick Cox Jun 14 '16 at 23:42

1 Answers1

3

You can do this by repeating the rownames:

df1 <- df[rep(row.names(df), df$FACT), 1:5]

> head(df1)
    age gender labour_situation industry_code FACT
1    35      M       unemployed            15 1510
1.1  35      M       unemployed            15 1510
1.2  35      M       unemployed            15 1510
1.3  35      M       unemployed            15 1510
1.4  35      M       unemployed            15 1510
1.5  35      M       unemployed            15 1510
> tail(df1)
      age gender labour_situation industry_code FACT
2.781  21      F         inactive             0  787
2.782  21      F         inactive             0  787
2.783  21      F         inactive             0  787
2.784  21      F         inactive             0  787
2.785  21      F         inactive             0  787
2.786  21      F         inactive             0  787

here 1:5 refers to the columns to keep. If you leave that bit blank, all will be returned.

jalapic
  • 13,792
  • 8
  • 57
  • 87
  • slow reply: yes, this is a partial solution and solves an important part of the things I've asked when I have time maybe i'll work in a package that does Stata-like things – pachadotdev Aug 13 '16 at 18:05