0

I have deployed my code in Azure Machine Learning and run the batch request in R with different operating systems, such as Unix and W10. For some reason, the host outputs are properly formatted only in R of W10 but I am unable to get properly formatted output in Unix systems. Only way I can get properly formatted outputs in all systems is through the Azure GUI and manually download the file. In W10, I have the luxury to get the properly formatted file directly with my Rscript/Rstudio thing. In R, I have used system("defaults write org.R-project.R force.LANG en_US.UTF-8") as hinted here to explicitly specify the encoding but this does not have any effect on the batch request R script that is executed in Azure servers run by Microsoft.

What is happening is that UTF-8 characters bytes are returned as Latin-1 characters bytes, for example

  1. ö as à ¶

  2. ä as à ¤

  3. Ä as à ¥

as can be demonstrated and tested with this tool here about Latin-1 characters. So what are best ways to deal with this encoding issue, can it be addressed somehow inside Azure ML? Where can you do bug reports? Does there exist some tool to convert Latin-1 to UTF-8 in R?

How can you get properly formatted UTF-8 files with umlauts with R batch requests in Azure ML (not in Latin-1 characters)?

Community
  • 1
  • 1
hhh
  • 50,788
  • 62
  • 179
  • 282
  • 1
    There are Windoze specific character sets: https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx. Seems difficult to consider this a bug unless you can show how one or more of your programs are failing to behave as documented. User error or confusion is not a "bug". The system call you posted looks more like an OSX system command? – IRTFM Jan 27 '17 at 21:50

1 Answers1

0

The Batch request R command has a saveBlobToFile function. The problem is in the saveBlobToFile function that uses wrong encoding with getUrl. getUrl function needs to specify the encodings explicitly. Do the following changes

blobContent = getURL(blobUrl, .encoding="UTF-8")

where without .encoding, the output is ISO8859-1('latin1') or something inherited from your system.

hhh
  • 50,788
  • 62
  • 179
  • 282