2

I've come across yet another strange interaction between R's data frames and Unicode characters on Windows, this time involving knitr and rmarkdown.

Implicit printing works correctly

I'm trying to print an HTML table based on a data frame containing Unicode characters, as represented by this simple example:

---
title: "Unicode Print Test"
---

```{r, results='asis'}
library(knitr)
knitr::kable(data.frame(eta="\U03B7"), format="html")
```

This produces the output I want when the document is knitted to HTML, shown below:

correct_table_output

Explicit printing does not

But in the real application, I need to print several tables from inside a for loop, meaning I have to explicitly print() the table:

```{r, results='asis'}
library(knitr)
x <- knitr::kable(data.frame(eta="\U03B7"), format="html")
print(x)
```

Now, the Unicode character is not printed correctly when the document is knitted to HTML:

incorrect_table_output

What to do?

Why does this difference between implicit and explicit printing occur? At least when executed in the R console, both explicit and implicit printing calls the knitr:::print.knitr_kable() function. I'm would guess it has something to do with the evalaute function (from the package of the same name) which actually executes the code in knitr code chunks, but I can't figure out what.

Is there any way I can have my explicit print() calls and get the correctly formatted output? I am aware of this locale workaround which seems to work for some other Unicode + Data Frame issues, but not this one.

EDIT: According to a knowledgeable commenter, this is a deep issue related to how and when R converts characters using the native Windows encoding prior to display. So, this is always going to be an issue when using the print function when on Windows, unless base R changes significantly.

Updated question: Are there any other methods (besides print()-ing) for getting a kable object to display from inside expressions such as for-loops?

SessionInfo()

## R version 3.5.1 (2018-07-02)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.20
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.5.1  backports_1.1.2 magrittr_1.5    rprojroot_1.3-2
##  [5] tools_3.5.1     htmltools_0.3.6 yaml_2.2.0      Rcpp_0.12.18   
##  [9] stringi_1.1.7   rmarkdown_1.10  highr_0.7       stringr_1.3.1  
## [13] digest_0.6.16   evaluate_0.11
whopper510
  • 487
  • 1
  • 5
  • 11
  • 1
    No, it is unlikely that the explicit printing will work in this case. More technical background: https://github.com/r-lib/evaluate/issues/59 (In short, if the character is not supported by your Windows native encoding, it will be destroying in explicit printing.) – Yihui Xie Aug 16 '19 at 03:27
  • Thanks, I had seem that issue and was afraid that was the case. Just surprised to learn that implicit and explicit printing can have such different behavior. I suppose there are no other clever hacks getting a `kable` object to display from inside a non top-level expression? – whopper510 Aug 16 '19 at 14:11
  • 1
    If that is your actual question, I think there should be a way to achieve your goal. You can store the output of `kable()` in a character vector, and implicitly print the vector via `knitr::asis_output(paste(your_vector, collapse = '\n'))` as a top-level expression later. – Yihui Xie Aug 16 '19 at 14:19
  • @whopper510 where do you print the output? You may not have to do *anything*. `` is just the way R or RStudio display Unicode characters, not what is actually printed to the console or a file. You can configure RStudio to display the Unicode text instead of the escape sequence – Panagiotis Kanavos Aug 16 '19 at 14:23
  • @PanagiotisKanavos It's being "printed" as the output in a HTML document created with knitr + rmarkdown. Unfortunately, the result is the same whether the source document uses the escape sequence, or the literal η character. – whopper510 Aug 16 '19 at 14:36
  • @YihuiXie Excellent Idea! Yes, if you accumulate each `kable` object into a list, then collapse the list into a character vector and display it with knitr::asis_output, this works! Unfortunately for me, I was also producing plots inside the `for` loop, and converting `ggplot` objects to character vectors doesn't work so well... – whopper510 Aug 16 '19 at 14:49
  • @whopper510 In that case (mixing tables and plots), you are right that this solution won't work. There is a brick wall in base R that I can't pass, unfortunately. I have talked to an R core member but the change is unlikely to happen. Perhaps all we can do is wait for Windows to officially support UTF-8... – Yihui Xie Aug 16 '19 at 15:34
  • Actually, if you don't mind hacking, you could use a **knitr** output hook to substitute the `` sequences with the actual Unicode characters. – Yihui Xie Aug 16 '19 at 15:36
  • So, override the output hook with `knit_hooks$set("output" = function(x, options) {...})`? OK, I'll give it a shot. I think since I'm targeting HTML I'll try replacing them with the HTML entities. – whopper510 Aug 16 '19 at 15:54

1 Answers1

1

So, apparently this character conversion issue is unlikely to resolve itself in the near future, and will probably only be solved at the OS level. But based on the excellent suggestions made by @YihuiXie in the comments, there are two ways this issue can be worked around. The best solution will depend on the context that your are creating the tables in.

Scenario 1: Tables Only

If the only type of object you need to output from inside your for-loop are tables, then you can accumulate the kable objects in a list inside the loop, then collapse the list of kables into a single character vector at the conclusion of the loop, and display it using knitr::asis_output.

```{r, results="asis"}
library(knitr)
character_list <- list(eta="\U03B7", sigma="\U03C3")
kable_list <- vector(mode="list", length = length(character_list))

for (i in 1:length(character_list)) {
  kable_list[[i]] <- knitr::kable(as.data.frame(character_list[i]),
                                  format="html"
                                  )
}

knitr::asis_output(paste(kable_list, collapse = '\n'))
```

This produces the following tables in the HTML document: enter image description here

Scenario 2: Tables and other objects (e.g. Plots)

If you're outputting both tables and other objects (e.g., plots) on each iteration of your for-loop, then the above solution wont work - you can't coerce your plots to a character vector! At this point, we have to result to some post-processing of the kable output by writing a customized knitr output hook.

The basic approach will be to replace the busted sequences in the table cells with the equivalent HTML entities. Note that because the table is created in an results="asis" chunk, we have to override the chunk level output hook, not the output level output hook (confusing, I know).

```{r hook_override}
library(knitr)
default_hook <- knit_hooks$get("chunk")

knit_hooks$set(chunk = function(x, options) {
  # only attempt substitution if output is a character vector, which I *think* it always should be 
  if (is.character(x)) {
    # Match the <U+XXXX> pattern in the output
    match_data <- gregexpr("<U\\+[0-9A-F]{4,8}>", x)
    # If a match is found, proceed with HTML entity substitution
    if (length(match_data[[1]]) >= 1 && match_data[[1]][1] != -1) {
      # Extract the matched strings from the output
      match_strings <- unlist(regmatches(x, match_data))
      # Extract the hexadecimal Unicode sequences from inside the <U > bracketing
      code_sequences <- unlist(regmatches(match_strings,
                                          gregexpr("[0-9A-F]{4,8}", match_strings)
                                          )
                               )
      # Replace any leading zero's with x, which is required for the HTML entities
      code_sequences <- gsub("^0{1,4}", "x", code_sequences)
      # Slap the &# on the front, and the ; on the end of each code sequence
      regmatches(x, match_data) <- list(paste0("&#", code_sequences, ";"))
    }
  }
  # "Print" the output
  default_hook(x, options)
})
``` 

```{r tables, results="asis"}
character_list <- list(eta="\U03B7", sigma="\U03C3")  
for (i in 1:length(character_list)) {
  x <- knitr::kable(as.data.frame(character_list[i]),
                    format="html"
                    )
  print(x)
}
```

```{r hook_reset}
knit_hooks$set(chunk = default_hook)
```

This produces the following tables in the HTML document:

enter image description here

Note that this time, the sigma doesn't display as σ like it did in the first example, it displays as s! This is because the sigma gets converted to an s before it gets to the chunk output hook! I have no idea how to stop that from happening. Feel free to leave a comment if you do =)

I also realize that using regular expressions to do the substitutions within the HTML table is probably fragile. If this approach happens to fail for your use case, perhaps using the rvest package to parse out each table cell individually would be more robust.

whopper510
  • 487
  • 1
  • 5
  • 11