4

I have some survey data. As an example, I use the credit data from the ÌSLR package.

library(ISLR)

The distribution of Gender in the data looks like this

prop.table(table(Credit$Gender))
  Male Female 
0.4825 0.5175 

and the distribution of Student looks like this.

prop.table(table(Credit$Student))
 No Yes 
0.9 0.1  

Let´s say, in the population, the actual distribution of Gender is Male/Female(0.35/0.65) and the distribution of Student is Yes/No(0.2/0.8).

In SPSS it´s possible to weight the samples, by dividing the "population distribution" by the "distribution of the sample" to simulated the distribution of the population. This process is called "RIM Weighting". The data will be only analyzed by crosstables (i.e. no regression, t-test, etc.). What is a good method in R the weight a sample, in order to analyze the data by crosstables later on?

It is possible to calculate the RIM weights in R.

install.packages("devtools")
devtools::install_github("ttrodrigz/iterake")


credit_uni = universe(df = Credit,
    category(
        name = "Gender",
        buckets = c(" Male", "Female"),
        targets = c(.35, .65)),
    category(
        name = "Student",
        buckets = c("Yes", "No"),
        targets = c(.2, .8)))

credit_weighted = iterake(Credit, credit_uni)



-- iterake summary -------------------------------------------------------------
 Convergence: Success
  Iterations: 5

Unweighted N: 400.00
 Effective N: 339.58
  Weighted N: 400.00
  Efficiency: 84.9%
        Loss: 0.178

Here the SPSS output (crosstables) of the weighted data

                Student     
                No  Yes 
Gender  Male    117 23  140
        Female  203 57  260
                320 80  400

and here from the unweighted data (I export both files and made the calculation in SPSS. I weighted the weighted sample by the calculated weights).

                Student     
                No  Yes 
Gender   Male   177 16  193
         Female 183 24  20          
                360 40  400

In the weighted data set, I have the desired distribution Student: Yes/No(0.2/0.8) and Gender male/female(0.35/0.65).

Here is another example using SPSS of Gender and Married (weighted)

    Married     
                No  Yes 
Gender   Male   57  83  140
         Female 102 158 260
                159 241 400

and unweighted.

                Married 
                No  Yes 
Gender   Male   76  117 193
         Female 79  128 207
                155 245 400

This doesn't work in R (i.e. both crosstables looks like the unweighted one).

library(expss)

cro(Credit$Gender, Credit$Married)

cro(credit_weighted$Gender, credit_weighted$Married)



 |               |              | Credit$Married |     |
 |               |              |             No | Yes |
 | ------------- | ------------ | -------------- | --- |
 | Credit$Gender |         Male |             76 | 117 |
 |               |       Female |             79 | 128 |
 |               | #Total cases |            155 | 245 |

 |                        |              | credit_weighted$Married |     |
 |                        |              |                      No | Yes |
 | ---------------------- | ------------ | ----------------------- | --- |
 | credit_weighted$Gender |         Male |                      76 | 117 |
 |                        |       Female |                      79 | 128 |
 |                        | #Total cases |                     155 | 245 |
Gregory Demin
  • 4,596
  • 2
  • 20
  • 20
Banjo
  • 1,191
  • 1
  • 11
  • 28
  • Which result do you expect? – Christoph Aug 18 '19 at 15:16
  • 1
    You might have some luck looking on [stats.se], especially for the underlying calculations – camille Aug 18 '19 at 16:14
  • 1
    The question is a hybrid (coding and theory). It´s about what is makeable in R and not so much, what´s the best theoretical solution. So I thought it´s a good question for stackoverflow. – Banjo Aug 18 '19 at 16:44
  • I agree it's a hybrid and don't think it's off topic here, just that you might find helpful discussions there as well. Especially if there isn't a predefined function in R, the stats site might be good for figuring out the math behind rolling your own function – camille Aug 18 '19 at 16:52
  • @Banjo if you run compare_margins( df = credit_weighted, weight = weight, universe = credit_uni, plot = TRUE ) %>% select(-contains("uwgt")) you actually do get the weighted data (which seems to match your SPSS output). Just take a look at the tibble it outputs? I don't think it gives the entire cross table though? – Dunois Aug 18 '19 at 17:00
  • Yes, but this works only for the variables I actually weighted. My original data has 200 variables. I also want to create crosstables with other variables (e.g. number of Cards, Education or Married in the credit data set). For those variables, I can´t calculate (weighted) crosstables. – Banjo Aug 18 '19 at 18:02
  • @Banjo, I don't quite follow. By "I can't calculate (weighted) crosstables." do you mean `iterake` doesn't produce a result, or are you implying that you don't want to run all those variable combinations manually? – Dunois Aug 18 '19 at 19:30
  • See my edit: Looking at the last two crosstables calculated by SPSS. For the first one, I only weighted Gender but the number of married males and females changes due to the simulated replication of Gender. If I calculate the same crosstable using R the number of married males and females stays the same. It doesn´t carry over for the whole data frame. – Banjo Aug 18 '19 at 21:31

1 Answers1

1

With expss package you need to explicitly provide your weight variable. As far as I understand iterake adds special variable weight to the dataset:

library(expss)

cro(Credit$Gender, Credit$Married) # unweighted result

cro(credit_weighted$Gender, credit_weighted$Married, weight = credit_weighted$weight) # weighted result
Gregory Demin
  • 4,596
  • 2
  • 20
  • 20