1

I have a character vector with multiple coding "errors" which I pulled from it.dbpedia.org. In fact, each accented characters is rendered incorrectly like "\"Democrazia è Libertà - La Margherita\"@it" instead of \"Democrazia è Libertà - La Margherita\"@it.

I found a debugging map for this kind of encoding problems here. Still I noticed that the relation between "actual" and "expected" characters is not one-to-one (as I would expect) but one-to-many. Then my character "Ã" might alternatively translate as "Á", "Í", "Ï", "Ð", "Ý", "à". In other words, I can't use a pattern/replacement solution for actual/expected characters.

Can I use a pattern/replacement solution with Unicode code points / expected characters? How do I pass to gsub() the unicode code point instead of the actual characters?

Should I use instead a package as stringi to solve the encoding issue? How?

UPDATE: I just noticed the problem is at the source: the XML output of SPARQL.

NOTE: Related to this unanswered question.

Community
  • 1
  • 1
CptNemo
  • 6,455
  • 16
  • 58
  • 107
  • Are you sure those are coding errors? It looks more like the source string is UTF-8 encoded but was read using a single-byte encoding like latin-1. – Tim Pietzcker Feb 27 '16 at 08:36
  • @TimPietzcker I used the `SPARQL` package to pull the data from Dbpedia. It seems I can't set any encoding in the SPARQL() request, which parse the XML output of Dpedia. Is there any way to solve the issue *after* the text has been downloaded in an R object? – CptNemo Feb 27 '16 at 08:41
  • Very strange since XML is most commonly UTF-8 encoded; you'd think that SPARQL would observe that, but the unanswered question you linked to seems to corroborate the evidence that it doesn't. I don't know R, so I can't say how you could proceed - does R have Unicode objects as opposed to byte strings that you could decode the string into using UTF-8? – Tim Pietzcker Feb 27 '16 at 09:02
  • Match the encoding and you should be fine. It's a big complicated world with no real cover-to-cover manual. Good luck. :) – Roman Luštrik Feb 27 '16 at 09:32

0 Answers0