Characters debugging based on code points

Question

I have a character vector with multiple coding "errors" which I pulled from it.dbpedia.org. In fact, each accented characters is rendered incorrectly like "\"Democrazia Ã¨ LibertÃ - La Margherita\"@it" instead of \"Democrazia è Libertà - La Margherita\"@it.

I found a debugging map for this kind of encoding problems here. Still I noticed that the relation between "actual" and "expected" characters is not one-to-one (as I would expect) but one-to-many. Then my character "Ã" might alternatively translate as "Á", "Í", "Ï", "Ð", "Ý", "à". In other words, I can't use a pattern/replacement solution for actual/expected characters.

Can I use a pattern/replacement solution with Unicode code points / expected characters? How do I pass to gsub() the unicode code point instead of the actual characters?

Should I use instead a package as stringi to solve the encoding issue? How?

UPDATE: I just noticed the problem is at the source: the XML output of SPARQL.

NOTE: Related to this unanswered question.

Are you sure those are coding errors? It looks more like the source string is UTF-8 encoded but was read using a single-byte encoding like latin-1. — Tim Pietzcker, Feb 27 '16 at 08:36
@TimPietzcker I used the `SPARQL` package to pull the data from Dbpedia. It seems I can't set any encoding in the SPARQL() request, which parse the XML output of Dpedia. Is there any way to solve the issue *after* the text has been downloaded in an R object? — CptNemo, Feb 27 '16 at 08:41
Very strange since XML is most commonly UTF-8 encoded; you'd think that SPARQL would observe that, but the unanswered question you linked to seems to corroborate the evidence that it doesn't. I don't know R, so I can't say how you could proceed - does R have Unicode objects as opposed to byte strings that you could decode the string into using UTF-8? — Tim Pietzcker, Feb 27 '16 at 09:02
Match the encoding and you should be fine. It's a big complicated world with no real cover-to-cover manual. Good luck. :) — Roman Luštrik, Feb 27 '16 at 09:32

Characters debugging based on code points

0 Answers0