1

I'm trying to read some json from a api page of a twitter firehose. In the tweets I download there are many no english character. eg:

"text":"Vaccini: perch\u00e9 fare l\u2019esavalente quando gli obbligatori sono solo quattro? http://t.co/dLdzoXOUUK via @wireditalia"

When I import the tweets data via readLines in R and print it I see:

\\"text\\":\\"Vaccini: perch\\u00e9 fare l\\u2019esavalente quando gli obbligatori sono solo quattro? http://t.co/dLdzoXOUUK via @wireditalia\\"

So, both backslashes and quotes are escaped. If I print only using cat() the escaping is not there anymore. So I thought a problem with print(). But when I parsed it with fromJSON I see that strings like \u00e9 become \xe9. I tried to understand why and by some test I noticed that

fromJSON('["\\u00e9"]') 

prints

"\xe9"

and

fromJSON('["\\u2019"]') 

prints

"\031"

instead of respectively "'" and "é", as it should. So jsonlite::fromJSON misinterpretate those double backslashes.

But the problem is the double backslashes themselves! Why R escapes everything in first place? I cannot even gsub('\u', '\u', text, fixed=T) but it returns:

Error: '\u' used without hex digits in character string starting "'\u"

because it sees \u like a special char and doesn't allow to be used as replacement!

Moreover this default escaping by R also make my script fail when it encounters one user who set this as location:

"location":"V\u03b1l\u03c1\u03bfl\u03b9c\u03b5ll\u03b1-V\u03b5r\u03bfn\u03b1 /=\\","default_profile_image":false

which in her twitter profile is:

Vαlροlιcεllα-Vεrοnα /=\

That \" in the source code is displayed on R as /=\', therefore breaking the json.

So, I need a way to escape this escaping problems!

Bakaburg
  • 3,165
  • 4
  • 32
  • 64
  • It's unclear exactly what's happening. Can you post a [minimal, reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? – MrFlick Jul 20 '14 at 19:24
  • now it should be better – Bakaburg Jul 21 '14 at 16:48
  • What library is the `fromJSON` function from? Are you sure R is doing the improper escaping? How are you getting the data into R? – MrFlick Jul 21 '14 at 17:00
  • it's all written in the text :) the library is jsonlinte; the escapes arent on the source code of the page in the browser and I' using readLines(). – Bakaburg Jul 21 '14 at 17:30
  • So whatever you are using to write them to disk is not writing them correctly. You should not see "\u2019" in the text file on disc. That means that data was not properly decoded. R escapes that value because that's exactly how it's represented in the file. I think you want a replace like `gsub('\\u00e9', '\u00e9', textt)`. But this really isn't R's fault. – MrFlick Jul 21 '14 at 17:51
  • The special chars are written as unicode codes in the json source code itself. The problem it's that I don't know how to make R understand that it should not escape those unicode chars but should convert them instead. I cannot really replace manually every unicode char that i run into!! I tried gsub('\u00e9', '\u', text, fixed=T) but it gaves me an error, "'\u' used without hex digits in character string starting "'\u"" because \u is interpreted as a special char itself and cannot be used for substitutions. – Bakaburg Jul 21 '14 at 17:58

1 Answers1

1

The problem is in your input data. The text you read into R should not have \u values as plain text. This is just incorrect. When R displays a value with \u that is an escape sequence for UTF characters, there aren't actually any slashes or "u"s in the text.

But if you have bad data that you need to read into R, you can find all the \u values followed by hexadecimal digits and replace them with proper Unicode characters. For example, say you have the string in T

tt<-"\\u00e9 and \\u2019 and \\u25a0"

If you cat() the value in R to remove escaping, you will see that is contains

cat(tt)
#\u00e9 and \u2019 and \u25a0

So there are "\u" values in the text (they are not true unicode characters). We can find and replace them with

m <- gregexpr("\\\\u[0-9A-Fa-f]{4}", tt)
regmatches(tt,m) <- lapply(
    lapply(regmatches(tt,m), substr,3, 999), function(x)  
    intToUtf8(as.integer(as.hexmode(x)), multiple=TRUE))
tt
# [1] "é and ’ and ■"

This will find all the "\u" values and replace them.

It's just important to note that

fromJSON('["\\u2019"]')

is not a unicode character. By doing the double backslash, you've escaped the escape character so you just literally have slash-u. To get a true unicode character you need

fromJSON('["\u2019"]') 

If your data were properly encoded before being loaded into R, this wouldn't be a problem. I don't understand what you are using to download the tweets, but clearly it is messing things up.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • Thanks a lot! of course cannot I force encoding after I downloaded the data? – Bakaburg Jul 22 '14 at 15:31
  • 1
    No. That file has been written to disc incorrectly. It is no longer an encoding issue. If there is a "\u" in the file, then there will be a "\u" in the file no matter the encoding. Both "\" and "u" are characters in the ASCII subset; they are no longer an escape sequence. Encoding just tells the computer how to interpret bytes in the file. Who ever wrote "\u" to the file has already changed those bytes in a non-standard way. – MrFlick Jul 22 '14 at 16:06
  • thanks. One question: the variable v in "lapply(v, substr,3, 999)" is the result of regmatches(tt,m)? than instead "sapply(v, substr,3, 999)" should be used! – Bakaburg Jul 22 '14 at 16:18
  • Yes. Sorry about that. I've updated the code. It should still result in a list so `lapply` is probably the more appropriate choice (more predictable) – MrFlick Jul 22 '14 at 16:33
  • yep, but lapply returns [[1]] [1] "00e9" "2019" for example and the subsequent lapply is applied to the upper level element. (or I thought so, anyway with lapply it doesn't work :D ) – Bakaburg Jul 22 '14 at 16:44