1

I have an issue when trying to parse a JSON file in russian alphabet in R. The file looks like this:

[{"text": "Валера!", "type": "status"}, {"text": "когда выйдет", "type": "status"}, {"text": "КАК ДЕЛА?!)", "type": "status"}]

and it is saved in UTF-8 encoding. I tried libraries rjson, RJSONIO and jsonlite to parse it, but it doesn't work:

library(jsonlite)
allFiles <- fromJSON(txt="ru_json_example_short.txt")

gives me error

Error in feed_push_parser(buf) : 
  lexical error: invalid char in json text.
                                       [{"text": "Валера!", "
                     (right here) ------^

When I save the file in ANSI encodieng, it works OK, but then, the Russian alphabet transforms into question marks, so the output is unusable. Does anyone know how to parse such JSON file in R, please?

Edit: Above mentioned applies for UTF-8 file saved in Windows Notepad. When I save it in PSPad and the parse it, the result looks like this:

    text   type
1                                         <U+0412><U+0430><U+043B><U+0435><U+0440><U+0430>! status
2 <U+043A><U+043E><U+0433><U+0434><U+0430> <U+0432><U+044B><U+0439><U+0434><U+0435><U+0442> status
3                              <U+041A><U+0410><U+041A> <U+0414><U+0415><U+041B><U+0410>?!) status
Pavel Sůva
  • 11
  • 1
  • 4
  • Are you on windows? Are you 100% sure the file is saved as UTF-8? – MrFlick May 11 '15 at 16:16
  • Yes, I am on Windows. I saved the file as UTF-8 in Notepad. – Pavel Sůva May 11 '15 at 16:19
  • in R, I tried `x<-'[{"text": "Валера!", "type": "status"}, {"text": "когда выйдет", "type": "status"}, {"text": "КАК ДЕЛА?!)", "type": "status"}]'` and verfied that `Encoding(x)=="UTF-8"`. Then i wrote it out with `writeLines(x, "test.txt", useBytes=TRUE)` and read it in with `fromJSON(txt="test.txt")` without a problem. – MrFlick May 11 '15 at 16:28
  • Yes, thanks, that works for me, but the the result is in the edited questions - instead of the russian letters, there are only their codes, which is also pretty as I want to work with the text and see what's written there... – Pavel Sůva May 11 '15 at 16:31
  • If you ran `x<-fromJSON(...)`, then what does `x$test` look like and what does `Encoding(x$text)` return? – MrFlick May 11 '15 at 16:32
  • It looks fine: > y$text [1] "Валера!" "когда выйдет" "КАК ДЕЛА?!)" > Encoding(y$text) [1] "UTF-8" "UTF-8" "UTF-8" So thank you a lot! So when I want to see the russian text, I have to look at the specific column (i.e. $text in this example), right? – Pavel Sůva May 11 '15 at 16:40
  • So it only looks funny when you do just `print(y)`? What version of R are you running? – MrFlick May 11 '15 at 16:42
  • R x64 3.1.2 with RStudio 0.98.1091 – Pavel Sůva May 11 '15 at 16:43
  • ...update to newest versions also didn't help... – Pavel Sůva May 12 '15 at 07:56
  • I actually don't use windows myself. It works fine on my mac. I did some quick tests and it looks like that while ` y$text ` prints fine, `format( y$text )` causes a conversion to "native locale" encoding on windows which is what basically gets called when you print the data.frame object. I'm not sure how to force that to UTF-8 encoding. – MrFlick May 12 '15 at 14:20

1 Answers1

1

Try the following:

dat <- fromJSON(sprintf("[%s]",
                paste(readLines("./ru_json_example_short.txt"),
                collapse=",")))
dat
[[1]]
       text   type
1      Валера! status
2 когда выйдет status
3  КАК ДЕЛА?!) status

ref: Error parsing JSON file with the jsonlite package

Community
  • 1
  • 1
Technophobe01
  • 8,212
  • 3
  • 32
  • 59