Automating a check for non-ASCII characters for a long list of variables in R

Question

I am trying to check several hundred variables in my data frame to figure out which of them contain non ASCII characters so that I can then convert an SPSS dataset into a .dta dataset using R. The data set comes from SPSS (.sav), I used the foreign package and read.spss(filename, to.data.frame = TRUE) to read it in R. Now I would like to write.dta to put my dataframe back into stata. But I get the error:

In abbreviate(ll, 80L) : abbreviate used with non-ASCII chars

Thanks to Josh O'Brien's response to the following post: "Removing non-ASCII characters from data files", I am able to use his code to check one variable at a time for non-ASCII characters.

## Do any lines contain non-ASCII characters? 
any(grepl("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII")))
[1] TRUE

and then check within any variable for which this is TRUE for the location of the non-ASCII characters.

## Find which lines (e.g. read in by readLines()) contain non-ASCII characters
grep("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII"))
[1] 1 2 3

Is there a way to use these functions in R to check multiple "x"s/variables/character vectors at once and return a list of the variables that contain non-ASCII characters, or can it only be done with a loop? Even more convenient would be a way to just tell R to convert all non-ASCII characters in the dataframe into something that is ASCII compatible so that I can write it into stata. So far I can envision using hadley's answer to the same post referenced above that I will need to convert each variable individually into an ascii compatible string variable and add it to my dataset and then drop the offending variable.

Andrey Kolyadin · Accepted Answer · 2017-08-01T06:48:50.217

0

Expanding on code from Hadley's answer:

library('stringi')
library('dplyr')

# simulating an example
x <- c("Ekstr\u00f8m", "J\u00f6reskog", "bi\u00dfchen Z\u00fcrcher")

df <- data.frame(id = 1:3,
                 logi = c(T, T, F),
                 test = x,
                 test2 = rev(x),
                 test_norm = c('Everything', 'is', 'perfect'))
# added several non-character columns to show that they are not affected
# Now translating every character column to ASCII

df2 <- df %>% 
  mutate_if(is.character,
        stri_trans_general,
        id = "latin-ascii")

df2

  id  logi             test            test2  test_norm
1  1  TRUE          Ekstrom bisschen Zurcher Everything
2  2  TRUE         Joreskog         Joreskog         is
3  3 FALSE bisschen Zurcher          Ekstrom    perfect

Of course it will work only with latin to ASCII.

edited Aug 01 '17 at 06:48

answered Jul 31 '17 at 14:52

Andrey Kolyadin

1,301
10
14

This is great and flexible. One can just change the if statement to if.integer or if.factor etc... for other types of variables - which, strangely, many of my variables are being stored as integers even though they have text/character responses. Also using "stri_enc_toascii(x)" instead of "stri_trans_general" works for a broader set of cases. One more step, is how to then use this "mutated" data to overwrite the old variables. When I View() my data, it still shows the un-transformed data. I'm not sure how to do this, but maybe it deserves a separate question. – solnza Jul 31 '17 at 16:33
@solnza Made an edit to assign new `data.frame` to `df2`. With pipe operator (`%>%`) expresiion `x %>% str()` is similar to `str(x)`. It all explained in `help(`%>%`)`. `stri_trans_toascii()` works a bit different from `stri_trans_general()` so feel free to choose more appropriate one. – Andrey Kolyadin Aug 01 '17 at 06:58

Automating a check for non-ASCII characters for a long list of variables in R

1 Answers1