0

I want to generate UTF-8 HTML output from a data frame using kable. I know that there are many similar questions on stackoverflow, but I still can't find a solution to this problem.

kable("ب",format="html")

generates:

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> x </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> &lt;U+0628&gt; </td>
  </tr>
</tbody>
</table>

R is running on Windows with the following session info:

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252   
[3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C                   
[5] LC_TIME=English_Canada.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] knitr_1.38

loaded via a namespace (and not attached):
[1] compiler_4.0.3 tools_4.0.3    highr_0.8      xfun_0.30

and syslocale:

Sys.getlocale()
[1] "LC_COLLATE=English_Canada.1252;LC_CTYPE=English_Canada.1252;LC_MONETARY=English_Canada.1252;LC_NUMERIC=C;LC_TIME=English_Canada.1252"

I've tried setting my locale to "en_US.UTF-8" but it seems this isn't supported on Windows. I also tried Sys.setlocale("LC_CTYPE", "arabic") but it didn't help.

I know how to convert the text in the table to html utf-8 escape codes (like &#xxxx;) but this makes for an awkward html file.

Is there a good solution for this? Or is it better to use a non-windows system for working with UTF-8?

1 Answers1

0

I found a possible solution by searching and replacing escaped utf-8 codes, but I'm hoping for a better solution.

library(knitr)
library(dplyr)

rep_utf8 = function(x){
  rep_one = function(x){
    pattern = "&lt;U\\+([A-F0-9]+)&gt;"
    a1 = regexpr(pattern,x,perl=TRUE)
    if(a1==-1) return(NULL)
    a2 = attr(a1,"match.length")
    a3 = attr(a1,"capture.start")
    a4 = attr(a1,"capture.length")
    j = strtoi(substr(x,a3,a3+a4-1),base=16L)
    a = intToUtf8(as.integer(j))
    paste0(
      substr(x,0,a1-1),
      a,
      substr(x,a1+a2,nchar(x))
    )
  }
  while(TRUE){
    x2 = rep_one(x)
    if(is.null(x2)) break;
    x = x2
  }
  x
}

writeUtf8 = function(a,filename){
  con = file(filename,"wb")
  writeBin(charToRaw(a), con)
  close(con)
}

a = "ب"
kable(a,format="html") %>%
  rep_utf8() %>%
  writeUtf8("test.html")

generates:

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> x </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> ب </td>
  </tr>
</tbody>
</table>