7

I have a bunch of author names from foreign countries in a CSV which R reads in just fine. I'm trying to clean them for upload to Mechanical Turk (which really doesn't like even a single internationalized character). In so doing, I have a question (to be posted later), but I can't even dput them in a sensible way:

> dput(df[306,"primauthfirstname"])
"Gwena\xeblle M"
> test <- "Gwena\xeblle M"
<simpleError in nchar(val): invalid multibyte string 1>

In other words, dput works just fine, but pasting the result in fails. Why doesn't dput output the necessary information to allow copy/pasting back into R (presumably all it needs to do is add the encoding attributes the a structure statement?). How do I get it to do so?

Note that \xeb is a valid character as far as R is concerned:

> gsub("\xeb","", turk.df[306,"primauthfirstname"] )
[1] "Gwenalle M"

But that you can't evaluate the characters individually--it's hex code \x## or nothing:

> gsub("\\x","", turk.df[306,"primauthfirstname"] )
[1] "Gwena\xeblle M"
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
  • 3
    This works fine for me: `(test <- "Gwena\xeblle M")` yields `[1] "Gwenaëlle M"`. I'm using R 2.14.0 with `LANG=en_US.UTF-8`. – Michael Hoffman Jul 06 '12 at 20:55
  • @MichaelHoffman What's the `LANG` piece? How do I check it? – Ari B. Friedman Jul 06 '12 at 21:02
  • It's an environment variable. Try `Sys.getenv("LANG")`. What version of R are you using? – Michael Hoffman Jul 06 '12 at 21:19
  • "en_US.UTF-8" R2.15.0 linux x64 – Ari B. Friedman Jul 06 '12 at 21:26
  • For me `test <- "Gwena\xeblle M"` yields `[1] "Gwena\xeblle M"` without the OP's error. I had the same gsub() errors as the OP though. I'm on 32-bit Linux, R 2.15.1 with `LANG=en_US.utf8`. – drammock Oct 09 '12 at 00:18
  • Just tested it again on a completely fresh Linux Mint install (64-bit R 2.15.1, same LANG variable) and it returned the same error. – Ari B. Friedman Oct 10 '12 at 02:27
  • For me, `"Gwena\xeblle M"` yields `"Gwenaëlle M"`, the first `gsub` removes the ë, and the third one does nothing, all as expected, I think. I do not have a `LANG` variable set. I am running this on R 2.15.1 on a 64-bit Windows 7 box. Interestingly, my locale variables are all set to `English_United States.1252`. – nograpes Oct 15 '12 at 16:56
  • `test` yields `"Gwenaëlle M"`. I'm on a Slovenian locale with a `LANG` values of `en`. – Roman Luštrik Oct 22 '12 at 09:16

1 Answers1

1

dput()'s helppage says: "Writes an ASCII text representation of an R object". So if your object contains non-ASCII characters, these cannot be represented and have to be converted somehow.

So I would suggest you use iconv() to convert your vector before dputing. One approach is:

> test <- "Gwena\xeblle M"
> out <- iconv(test, from="latin1", to="ASCII", sub="byte")
> out
[1] "Gwena<eb>lle M"
> gsub('<eb>', 'ë', out)
[1] "Gwenaëlle M"

which, as you see, works both ways. You can later use gsub() to back-convert bytes into characters (if your encoding supports it, e.g. utf-8).

The second approach is simpler (and I guess preferable for your needs), but works one-way and your libiconv may not support it:

> test <- "Gwena\xeblle M"
> iconv(test, from="latin1", to="ASCII//TRANSLIT")
[1] "Gwenaelle M"

Hope this helps!

Theodore Lytras
  • 3,955
  • 1
  • 18
  • 25