18

What's the prefered way in R to convert a character (vector) containing non-ASCII characters to html? I would for example like to convert

  "ü"

to

  "ü"

I am aware that this is possible by a clever use of gsub (but has anyone doen it once and for all?) and I thought that the package R2HTML would do that, but it doesn't.

EDIT: Here is what I ended up using; it can obviously be extended by modifying the dictionary:

char2html <- function(x){
  dictionary <- data.frame(
    symbol = c("ä","ö","ü","Ä", "Ö", "Ü", "ß"),
    html = c("&auml;","&ouml;", "&uuml;","&Auml;",
             "&Ouml;", "&Uuml;","&szlig;"))
  for(i in 1:dim(dictionary)[1]){
    x <- gsub(dictionary$symbol[i],dictionary$html[i],x)
  }
  x
}

x <- c("Buschwindröschen", "Weißdorn")
char2html(x)
Philipp
  • 479
  • 5
  • 10
  • It sounds like this: http://stackoverflow.com/questions/5060076/convert-html-character-entity-encoding-in-r might point you in the right direction. – Phil Nelson Oct 22 '12 at 22:44
  • 2
    Yepp, that's the other way round :) I just checked the XML package: it has a `toHTML` function, but that does not solve the above question. It seems such a basic thing to do: every WYSIWYG html editor can do that. – Philipp Oct 23 '12 at 05:15
  • Just out of curiosity: Why do you still need that in ages of UTF-8? – feeela Oct 23 '12 at 09:24
  • I am using a kind of content management system that only allows me to supply "pure" html, but some of my data is in UTF-8, – Philipp Oct 23 '12 at 13:25
  • Is there an easy way to do the opposite of what your function is currently doing? I'd like to convert `"ü"` to its respective character. Thoughts? – petergensler Mar 10 '17 at 17:18

2 Answers2

3

This question is pretty old but I couldn't find any straightforward answer... So I came up with this simple function which uses the numerical html codes and works for LATIN 1 - Supplement (integer values 161 to 255). There's probably (certainly?) a function in some package that does it more thoroughly, but what follows is probably good enough for many applications...

conv_latinsupp <- function(...) {
  out <- character()
  for (s in list(...)) {
    splitted <- unlist(strsplit(s, ""))
    intvalues <- utf8ToInt(enc2utf8(s))
    pos_to_modify <- which(intvalues >=161 & intvalues <= 255)
    splitted[pos_to_modify] <- paste0("&#0",  intvalues[pos_to_modify], ";")
    out <- c(out, paste0(splitted, collapse = ""))
  }
  out
}

conv_latinsupp("aeiou", "àéïôù12345")
## [1] "aeiou"   "&#0224;&#0233;&#0239;&#0244;&#0249;12345"
Dominic Comtois
  • 10,230
  • 1
  • 39
  • 61
1

The XML uses a method insertEntities for this, but that method is internal. So you may use it at your own risk, as there are no guarantees that it will remain to operate like this in future versions.

Right now, your code could be accomplished using

char2html <- function(x) XML:::insertEntities(x, c("ä"="auml", "ö"="ouml", …))

The use of a named list instead of a data.frame feels kind of elegant, but doesn't change the core of things. Under the hood, insertEntities calls gsub in much the same way your code does.

If numeric HTML entities are valid in your environment, then you could probably convert all your text into those using utf8ToInt and then turn safely printable ASCII characters back into unescaped form. This would save you the trouble of maintaining a dictionary for your entities.

MvG
  • 57,380
  • 22
  • 148
  • 276