0

I have some html files (I'm working with them as plain texts) that utilize decimal NCR to encode special characters. Is there a way to convert them conveniently to Unicode using R?
NCR codes does not always have one-on-one match with unicode and it becomes quite confusing, since ѣ is equal not to \u1123, but to \u0463:

> stri_unescape_unicode("\u1123")
[1] "ᄣ"

and

> stri_unescape_unicode("\u0463")
[1] "ѣ"
perechen
  • 125
  • 9
  • Have you seen [This earlier post](https://stackoverflow.com/q/29145456/4752675) ? – G5W Feb 02 '20 at 13:56

1 Answers1

2

1123 is the decimal equivalent of the hexadecimal 0463, and Unicode uses hexadecimal. So in order to get a conversion, you need to strip out the non-digit characters, convert the digits to hex characters, stick a "\u" in front of them then use stri_unescape_unicode.

This function will do all that:

ncr2uni <- function(x)
{
  # Strip out non-digits and and convert remaining numbers to hex
  x <- as.hexmode(as.numeric(gsub("\\D", "", x)))

  # Left pad with zeros to length 4 so escape sequence is recognised as Unicode 
  x <- stringi::stri_pad_left(x, 4, "0")

  # convert to Unicode
  stringi::stri_unescape_unicode(paste0("\\u", x))
}

Now you can do

ncr2uni(c("&#1123;", "&#1124;", "&#1125;"))
# [1] "ѣ" "Ѥ" "ѥ"
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • 1
    Thanks a lot! I didn't realize unicode is hexadecimal at all. Now i feel silly and enlightened at the same time. – perechen Feb 02 '20 at 23:19