I'm trying to read some json from a api page of a twitter firehose. In the tweets I download there are many no english character. eg:
"text":"Vaccini: perch\u00e9 fare l\u2019esavalente quando gli obbligatori sono solo quattro? http://t.co/dLdzoXOUUK via @wireditalia"
When I import the tweets data via readLines in R and print it I see:
\\"text\\":\\"Vaccini: perch\\u00e9 fare l\\u2019esavalente quando gli obbligatori sono solo quattro? http://t.co/dLdzoXOUUK via @wireditalia\\"
So, both backslashes and quotes are escaped. If I print only using cat() the escaping is not there anymore. So I thought a problem with print(). But when I parsed it with fromJSON I see that strings like \u00e9 become \xe9. I tried to understand why and by some test I noticed that
fromJSON('["\\u00e9"]')
prints
"\xe9"
and
fromJSON('["\\u2019"]')
prints
"\031"
instead of respectively "'" and "é", as it should. So jsonlite::fromJSON misinterpretate those double backslashes.
But the problem is the double backslashes themselves! Why R escapes everything in first place? I cannot even gsub('\u', '\u', text, fixed=T) but it returns:
Error: '\u' used without hex digits in character string starting "'\u"
because it sees \u like a special char and doesn't allow to be used as replacement!
Moreover this default escaping by R also make my script fail when it encounters one user who set this as location:
"location":"V\u03b1l\u03c1\u03bfl\u03b9c\u03b5ll\u03b1-V\u03b5r\u03bfn\u03b1 /=\\","default_profile_image":false
which in her twitter profile is:
Vαlροlιcεllα-Vεrοnα /=\
That \" in the source code is displayed on R as /=\', therefore breaking the json.
So, I need a way to escape this escaping problems!