0

We are using the ESS data set, but are unsure how to deal with the issue of missing values in SAS Enterprise Guide. Our dependent variable is "subjective wellbeing", and aim to include a large amount of control variables - hence, we have a situation where we have a data set containing a lot of missing values.

We do not want to use "list-wise deletion". Instead, we would like to treat the different missings in different manners depending on the respondent's response: "no answer", "Not applicable", "refusal", "don't know". For example, we plan to conduct pair-wise deletion of non-applicable, while we might want to use e.g. the mean value for some other responses - depending on the question (under the assumption that the respondent's response provide information about MCAR, MAR, NMAR).

Our main questions are:

  • Currently, our missing variables are marked in different ways in the data set (99, 77, 999, 88 etc.), should we replace these values in Excel before proceeding in SAS Enterprise Guide? If yes - how should we best replace them as they are supposed to be treated in different ways?
  • How do we tell SAS Enterprise Guide to treat different missings in different ways?
  • If we use dummy variables to mark refusals for e.g. income, how do we include these in the final regression?

We have tried to read about this but are a bit confused, so we would really appreciate any help :)

alexwhitworth
  • 4,839
  • 5
  • 32
  • 59
  • Yes you should replace your missing before modelling - but do it in EG not Excel. You can trace your changes this way and if you run a model and change your mind on how to deal with a particular missing case it's easier to fix. Imputation is the term for filling in missing values - I'm not sure how exactly EG accomplishes this. This question is also better posted on CrossValidated as it relates more to statistical methodology than programming. – Reeza Mar 24 '16 at 11:30
  • Welcome to SO. Please provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)... At a bare minimum, (1) there is no explanation of the "ESS data set" and (2) your desired treatment of missing values is thoroughly unclear. Please clarify both. – alexwhitworth Mar 30 '16 at 22:10

1 Answers1

1

On a technical note, SAS offers special missing values: .a .b .c etc. (not case sensitive). Replace the number values in SAS e.g. 99 =.a 77 = .b Decisions Trees for example will be able to handle these as separate values.

To keep the information of the missing observations in a regression model you will have to make some kind of tradeoff (find the least harmful solution to your problem).

  • One classical solution is to create dummy variables and replace the missing values with the mean. Include both the dummies and the original variables in the model. Possible problems: The coefficients will be biased, multicollinearity, too many categories/variables.

  • Another approaches would be to BIN your variables into categories. Do it just by value (e.g. deciles) and you may suffer information loss. Do it by theory and you may suffer confirmation bias.

  • A more advanced approach would be to calculate the information value (http://support.sas.com/resources/papers/proceedings13/095-2013.pdf) of your independent variables. Thereby replacing all values including the missings. Of cause this will again lead to bias and loss of information. But might at least be a good step to identify useful/useless missing values.

Jetzler
  • 787
  • 3
  • 11