3

I am trying to download the data from the website that contains the contents in both English and local language (non-English). I was able to get the data that were in English, but for the contents that were in the local language, I got something like below. My question is how do I display both?

X1  X2  X3
NA      
1   <U+0926><U+094B><U+0932><U+0916><U+093E>    <U+0915><U+093E><U+0932><U+093F><U+0928><U+094D><U+091A><U+094B><U+0915>  <U+0917><U+093E><U+0909><U+0901><U+092A><U+093E><U+0932><U+093F><U+0915><U+093E>
2   <U+0926><U+094B><U+0932><U+0916><U+093E>    <U+0915><U+093E><U+0932><U+093F><U+0928><U+094D><U+091A><U+094B><U+0915>  <U+0917><U+093E><U+0909><U+0901><U+092A><U+093E><U+0932><U+093F><U+0915><U+093E>
3   <U+0926><U+094B><U+0932><U+0916><U+093E>    <U+0915><U+093E><U+0932><U+093F><U+0928><U+094D><U+091A><U+094B><U+0915>  <U+0917><U+093E><U+0909><U+0901><U+092A><U+093E><U+0932><U+093F><U+0915><U+093E>
4   <U+0926><U+094B><U+0932><U+0916><U+093E>    <U+0915><U+093E><U+0932><U+093F><U+0928><U+094D><U+091A><U+094B><U+0915>  <U+0917><U+093E><U+0909><U+0901><U+092A><U+093E><U+0932><U+093F><U+0915><U+093E>
5   <U+0926><U+094B><U+0932><U+0916><U+093E>    <U+0915><U+093E><U+0932><U+093F><U+0928><U+094D><U+091A><U+094B><U+0915>  <U+0917><U+093E><U+0909><U+0901><U+092A><U+093E><U+0932><U+093F><U+0915><U+093E>
6   <U+0926><U+094B><U+0932><U+0916><U+093E>    <U+0915><U+093E><U+0932><U+093F><U+0928><U+094D><U+091A><U+094B><U+0915>  <U+0917><U+093E><U+0909><U+0901><U+092A><U+093E><U+0932><U+093F><U+0915><U+093E>
7   <U+0926><U+094B><U+0932><U+0916><U+093E>    <U+0915><U+093E><U+0932><U+093F><U+0928><U+094D><U+091A><U+094B><U+0915>  <U+0917><U+093E><U+0909><U+0901><U+092A><U+093E><U+0932><U+093F><U+0915><U+093E>
8   <U+0926><U+094B><U+0932><U+0916><U+093E>    <U+0915><U+093E><U+0932><U+093F><U+0928><U+094D><U+091A><U+094B><U+0915>  <U+0917><U+093E><U+0909><U+0901><U+092A><U+093E><U+0932><U+093F><U+0915><U+093E>
9   <U+0926><U+094B><U+0932><U+0916><U+093E>    <U+0915><U+093E><U+0932><U+093F><U+0928><U+094D><U+091A><U+094B><U+0915>  <U+0917><U+093E><U+0909><U+0901><U+092A><U+093E><U+0932><U+093F><U+0915><U+093E>
10  <U+0926><U+094B><U+0932><U+0916><U+093E>    <U+0915><U+093E><U+0932><U+093F><U+0928><U+094D><U+091A><U+094B><U+0915>  <U+0917><U+093E><U+0909><U+0901><U+092A><U+093E><U+0932><U+093F><U+0915><U+093E>
user227710
  • 3,164
  • 18
  • 35

1 Answers1

4

You likely have the text that you want, It simply is being displayed incorrectly.

I can reproduce your problem. Your example data had the same strings 10 times. To keep the display reasonable, I am only repeating 3 times.

## Hex codes from your example
S1 = c("0926", "094B", "0932", "0916", "093E") 
S2 = c("0915", "093E", "0932", "093F", "0928", "094D", "091A", "094B", "0915")  
S3 = c("0917", "093E", "0909", "0901", "092A", "093E", "0932", "093F", "0915", "093E")

## Convert to Devanagari strings
X1 = rep(intToUtf8(strtoi(S1, base=16L)), 3)
X2 = rep(intToUtf8(strtoi(S2, base=16L)), 3)
X3 = rep(intToUtf8(strtoi(S3, base=16L)), 3)

df = data.frame(X1, X2, X3, stringsAsFactors=FALSE)

Now X1 will display correctly, but df will not

Bizarrely, df$X1 and df[,1] will display the unicode, but df[1, ] will not.

A workaround is that as.matrix(df) will display the whole thing as unicode characters.

This is apparently a known bug in the Windows version of the RGui. Some other explorations of this can be found at this Earlier SO Question and this Mailing List Post

Addendum

Writing these strings to a readable Unicode file requires some care. This created a csv file for my example.

Mat = as.matrix(df)
F <- file("Test1.csv", "wb", encoding="UTF-8")
BOM <- charToRaw('\xEF\xBB\xBF')
writeBin(BOM, F)
for(r in 1:nrow(Mat)) {
    Line = paste(Mat[r,], collapse=",")
    writeLines(Line, F, useBytes=T) 
}
close(F)
G5W
  • 36,531
  • 10
  • 47
  • 80
  • Thank you very much for the answer. Any idea how to save the matrix so that I can view characters not only in console but also in the saved file (e.g., csv)? – user227710 Jun 03 '17 at 06:14
  • 1
    Too complicated for comment. Adding to answer. – G5W Jun 03 '17 at 11:39