How to treat encoding when reading .dta-files into R from Stata-files prior to version 14?

Question

How can one dodge the encoding problems when reading Stata-data into R?

The dataset I wish to read is a .dta in either Stata 12 or Stata 13 (before Stata introduced support for utf-8 in version 14). Text-variables with Swedish and German letters å, ä, ö, ß, as well as other characters do not import well.

I have tried these answers, read.dta in foreign, the haven package (with no encoding-parameters), and now read_stata13, which informs me that it expects Stata files to be encoded in CP1252. But alas, the encoding doesn't work. Should I give up and and use a .csv-export as a bridge instead, or is it actually possible to read .dta-files in R?

Minimal example:
This code downloads the first few lines of my dataset, and illustrates the problem, for example in the variable vocation which contain Scandinavian languages.

setwd("~/Downloads/")
system("curl -O http://www.lilljegren.com/stackoverflow/example.stata13.dta", intern=F)

library(foreign)
?read_dta
df1 <- read_dta('example.stata13.dta', encoding="latin1")
df2 <- read_dta('example.stata13.dta', encoding="CP1252")
library(readstata13)
df3 <- read.dta13('example.stata13.dta', fromEncoding="latin1")
df4 <- read.dta13('example.stata13.dta', fromEncoding="CP1252")
df5 <- read.dta13('example.stata13.dta', fromEncoding="utf-8")

vocation <- c("Brandkorpral","Sömmerska","Jungfru","Timmerman","Skomakare","Skräddare","Föreståndare","Platsförsäljare","Sömmerska")
df4$vocation == vocation
# [1]  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE

`csv` is probably the best thing to do. Or if you have Stata 14 convert the files to Unicode first and save. — , Nov 06 '18 at 16:02
This is what I'm fearing. I'm looking at different files Stata builds using `enca`, but it is not able to guess what encoding they are, and I also have some encoding problems reading the csv-files that Stata generates. Uhhh. Stata really isn't awesome :/ 21st century software without support for utf-8 :( — nJGL, Nov 06 '18 at 17:20
Stata's current version is 15 and as of version 14 supports Unicode. Not sure why you are complaining for features that are not available in software that is two versions behind and no longer supported / maintained. Upgrade? — , Nov 06 '18 at 17:49
I am poor, and Stata is a licensed software that'd cost me expensively for an upgrade needed merely to resolve this encoding-problem that, I think one could argue, shouldn't have to belong to our decade. But duly noted: I was grumpy. :) Besides, the correct encoding was `"macroman"`, and I found out by going through the `csv`-solution, as you suggested, so thank you. — nJGL, Nov 07 '18 at 08:49

score 4 · Accepted Answer · edited Nov 07 '18 at 10:30

4

The correct encoding to read files generated by Stata prior to version 14 on Macs is "macroman"

df <- read.dta13('example.stata13.dta', fromEncoding="macroman")

On my Mac, both .dta-files in stata13 and stata12 formats (saved by saveold in Stata 13) imported nicely like this.

Supposedly, the manual of read_stata13, correctly assumes "CP1252" on other platforms. To me, "macroman", however, did the trick, (also for the .csv-files that Stata 13 generated with export delimited).

edited Nov 07 '18 at 10:30

Nick Cox

35,529
6
31
47

answered Nov 07 '18 at 08:52

nJGL

819
5
17

Note that you make no mention whatsoever in your question that you are using a Mac. Which is probably why nobody answered. – Nov 07 '18 at 11:49

How to treat encoding when reading .dta-files into R from Stata-files prior to version 14?

1 Answers1