I have a character vector with multiple coding "errors" which I pulled from it.dbpedia.org. In fact, each accented characters is rendered incorrectly like "\"Democrazia è Libertà - La Margherita\"@it"
instead of \"Democrazia è Libertà - La Margherita\"@it
.
I found a debugging map for this kind of encoding problems here. Still I noticed that the relation between "actual" and "expected" characters is not one-to-one (as I would expect) but one-to-many. Then my character "Ã" might alternatively translate as "Á", "Í", "Ï", "Ð", "Ý", "à". In other words, I can't use a pattern/replacement solution for actual/expected characters.
Can I use a pattern/replacement solution with Unicode code points / expected characters? How do I pass to gsub()
the unicode code point instead of the actual characters?
Should I use instead a package as stringi
to solve the encoding issue? How?
UPDATE: I just noticed the problem is at the source: the XML output of SPARQL.
NOTE: Related to this unanswered question.