2

The following output is what generated when I save a R- data frame into Json format. My dataframe has mix of html links and some accent characters. I have to work with this file in PHP/Html environment.

library(jsonlite)
output_json <- toJSON(output, dataframe = "rows", pretty = T)
write(output_json, file = "output.txt")

  {
  "PMID":"<a href= \"http://www.ncbi.nlm.nih.gov/pubmed/?term=19369233\"
           target=\"_blank\">19369233</a>",
  "Title":"Delayed achievement of cytogenetic and molecular response is
          associated with increased risk of progression among patients with
          chronic myeloid leukemia in early chronic phase receiving
          high-dose or standard-dose imatinib therapy.",
  "Author":"Quintás-Cardama A",
  "Random  author names":"Järås M", "Imrédi E", "Tímár J."      
},

When I open the output.txt file or print output on html page the accent letters in first author and last author changes to ? eg: Imr�di E.

When I use below PHP code decode to read the json file it fails and returns NULL. On research at SO I am certain that the issue is from the accent characters, and also in some cases improper escaping of the new lines \r\n or html tags.

!-- language: lang-php --> 
$r_output = file_get_contents('output.txt');
$array_json = json_decode($r_output, true);

I tried to fix by following suggestions Eg: How do I handle newlines in JSON? or PHP json_decode() returns NULL with valid JSON? etc. However, could not solve this issue.

Hence, tagging PHP and R users, to find out if there is a better way to write the JSON format in R to avoid this issue or clean the json format before reading it in php ?

Thank you for help

Community
  • 1
  • 1
user5249203
  • 4,436
  • 1
  • 19
  • 45
  • 2
    The issue is probably caused by a charset mismatch. I'd wager that the input data is ISO-8859 and JSON is inherently assumed to be UTF8. – Sammitch May 02 '16 at 20:47
  • 1
    sounds like it worked fine, your just not matching the page encoding –  May 02 '16 at 20:47
  • @sammitch you are probably right. @miken32, I thought by default `tojson` encodes in UTF-8 format. Thank you for pointing that encoding is required when write function is called, thank you for the link. I hope it is not duplicate. I should have probably thought to solve the encoding issue from write file aspect, rather than trying to address the conversion in tojson() function. Thanks for the direction – user5249203 May 02 '16 at 21:20

2 Answers2

2

try to utf8_encode the $r_output and remove the line breaks i.e.:

$r_output = utf8_encode(file_get_contents('output.txt'));
$r_output = preg_replace("/[\n\r]/","",$r_output);
$array_json = json_decode($r_output, true);

alternatively try utf8_decode:

$r_output = utf8_decode(file_get_contents('output.txt'));
$r_output = preg_replace("/[\n\r]/","",$r_output);
$array_json = json_decode($r_output, true);

PS: your json seems invalid ->"Imrédi E", Tímár J."

Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268
1

Write the output file as UTF-8 to begin with:

library(jsonlite)
output_json <- toJSON(output, dataframe = "rows", pretty = T)
con<-file("output.txt", encoding="UTF-8")
write(output_json, file = con)
miken32
  • 42,008
  • 16
  • 111
  • 154