Problems when importing factor variables from Stata using readstata13 package

Question

I have a very odd problem. I'm importing some factor variables from Stata into R using readstata13 package. The imported labels/levels look ok, but they change when removing factor class. Here is the Stata description of the variable (here is the data for reproducibility):

This is an image

Notice some labels are missing (UPDATE: actually, they are not missing. Rather, they are filled with a space, an odd way the coder used to highlight missing label). Notice also variable value 13 has 7 observations.

So I import the data in R and check levels and frequency. All fine:

Image here

Then I remove the levels using as.integer() (or as.numeric()), but things mess up. In particular values 11, 12 and 13. Notice now 11 has 7 observations, rather than 13:

Image here

The problem remains, regarding of read.dta13 options related to factors. I tried the second suggestion in this answer, using the following code, but did not work (most likely because only two values have labels):

labname <- get.label.name(data,"J_Itm1")
labtab <- get.label(data, labname)
table(get.origin.codes(data$J_Itm1, labtab))

Any idea how to solve the problem?

It's better if you can add a sample of the data, enough to recreate the issue, directly in the post such as with `dput`. People don't necessarily trust third-party anonymous download sites — camille, Dec 23 '21 at 17:34
Don't read them in as factors in the first place, play with `convert.factors = TRUE, generate.factors = FALSE,` options, read the documentation for `?readstata13::read.dta13` more thoroughly. — jay.sf, Dec 23 '21 at 17:36
I thought however voted as dupe would at least provide a suggestion of why it might be a dupe. Feels like low effort reviewing to me. — luchonacho, Dec 23 '21 at 18:02
@jay.sf `convert.factors = FALSE` did the trick. Don't quite understand which was the problem with Stata factors. It seems read.dta13 orders them differently when no all levels are defined. — luchonacho, Jan 06 '22 at 11:49
@luchonacho Great you found a solution! However, I can't reproduce your issue, actually I see no difference between factor and numeric: `lapply(transform(readstata13::read.dta13('~/Downloads/test.dta'), J_Itm1_num=as.numeric(J_Itm1)), table)` using readstata13 ‘0.10.0’ on R version 4.1.2 (2021-11-01)). — jay.sf, Jan 06 '22 at 12:13
I agree that the proposed duplicate target does not seem to fully address the issue. I was unable to find a better duplicate, so I voted to reopen. It would be great if you can self-answer this question. For future reference, [you can @-notify](https://meta.stackexchange.com/questions/43019/how-do-comment-replies-work) any gold badge holder who has bindingly voted to close your question. — Ian Campbell, Jan 06 '22 at 15:08

score 0 · Accepted Answer · answered Jan 07 '22 at 10:41

It seems the problem is that the package readstata13 recreates factor values in R, without keeping the order of those in Stata.

The "solution" was to not import levels from Stata. This can be achieved using the convert.factors = FALSE option. Although not an optimal solution, it works for me because I do not need factor levels in the first place. I raised an issue in the package's website to see potential solutions.

Problems when importing factor variables from Stata using readstata13 package

1 Answers1