I am trying to check several hundred variables in my data frame to figure out which of them contain non ASCII characters so that I can then convert an SPSS dataset into a .dta dataset using R. The data set comes from SPSS (.sav), I used the foreign package and read.spss(filename, to.data.frame = TRUE) to read it in R. Now I would like to write.dta to put my dataframe back into stata. But I get the error:
In abbreviate(ll, 80L) : abbreviate used with non-ASCII chars
Thanks to Josh O'Brien's response to the following post: "Removing non-ASCII characters from data files", I am able to use his code to check one variable at a time for non-ASCII characters.
## Do any lines contain non-ASCII characters?
any(grepl("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII")))
[1] TRUE
and then check within any variable for which this is TRUE for the location of the non-ASCII characters.
## Find which lines (e.g. read in by readLines()) contain non-ASCII characters
grep("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII"))
[1] 1 2 3
Is there a way to use these functions in R to check multiple "x"s/variables/character vectors at once and return a list of the variables that contain non-ASCII characters, or can it only be done with a loop? Even more convenient would be a way to just tell R to convert all non-ASCII characters in the dataframe into something that is ASCII compatible so that I can write it into stata. So far I can envision using hadley's answer to the same post referenced above that I will need to convert each variable individually into an ascii compatible string variable and add it to my dataset and then drop the offending variable.