2

I am trying to read several SPSS files into R that include Cyrillic text. All of the files are in Cyrillic text. When I read most of them into R, the console says "re-encoding from CP1251". However, when I read some of the files, also in Cyrillic text, it says "re-encoding from CP1252" which I think is a Latin script. The CP1251 files read into R with no problem. However, the CP1252 files become gibberish in R. I’ve tried the foreign, haven and hmisc packages for reading in the SPSS files and none have worked. I've also tried including reencode='utf-8'. When I do this, the Cyrillic text all becomes NA. The problem occurs whether I'm working in R or RStudio.

x1<- read.spss("cp1251_file.sav", to.data.frame = T) #1251 file reads in fine

x2<- read.spss("cp1252_file.sav", to.data.frame = T) #1252 file becomes gibberish

x2<- read.spss("cp1252_file.sav", to.data.frame = T, reencode='utf-8') #Cyrillic text in CP1252 file becomes NA

Thanks for your help.

ab27
  • 21
  • 4
  • for me it works for German umlaute (üäö) with a combination of the following: `options(encoding = "UTF-8"); spssfile <- as.data.set(spss.system.file('yourfiles.sav')); spssfile <- Iconv(spssfile,from="UTF-8",to="UTF-8")`can you check those? – Jan Jul 06 '17 at 03:52
  • this question/answers may also be helpful: https://stackoverflow.com/questions/3136293/read-spss-file-into-r?rq=1 – Jan Jul 06 '17 at 04:00
  • Thank you. I've tried this and now I get an error when I try to convert to a dataframe. spssfile <- as.data.set(spss.system.file('file.sav', use.value.labels = FALSE)); spssfile <- Iconv(spssfile,from="UTF-8",to="UTF-8"); df<- as.data.frame(spssfile, stringsAsFactors=F); error: Error in as.factor(x) : Duplicate labels – ab27 Jul 06 '17 at 04:47
  • Looks like it works if I tell R that the file is CP1251 even though it thinks it is CP1252. Thanks!: 'df <- spss.system.file("file.sav") df <- Iconv(df,from="CP1251",to="UTF-8") df1<-as.data.frame(as.data.set(df))' – ab27 Jul 06 '17 at 18:30

1 Answers1

0

Looks like it works if I use the memisc package and I tell R that the file is CP1251 even though it thinks it is CP1252 when using read.spss. Thanks!:

df <- spss.system.file("file.sav") df <- Iconv(df,from="CP1251",to="UTF-8") df1<-as.data.frame(as.data.set(df))

ab27
  • 21
  • 4