2

When I read a determinate spanish web site I get spanish accents in HTML encoding. I read the website with readLines function (I need use this function).

url <- "http://www.senamhi.gob.pe/include_mapas/_map_data_hist03.php?drEsta=01"
char_data <- readLines(url,encoding="UTF-8")

After making all operations to get my data I have a data frame where I have a variable with a character values that are words with accents. It would be something like:

var <- rep("Meteorol&oacute;gica",5)

I need to convert the spanish accents in HTML encoding to normal spanish accents. I tested with iconv function

iconv(var, "UTF-8", "ASCII")

But It doesn't work, I get the same vector of characters of input. Also I tested changing encoding option in readLines function but neither works.

How can I do? Thanks.

arvi1000
  • 9,393
  • 2
  • 42
  • 52
pescobar
  • 177
  • 1
  • 1
  • 10
  • 2
    It's more likely that we will be able to help you if you make a minimal reproducible example to go along with your question. Something we can work from and use to show you how it might be possible to solve your problem. You can have a look at [this SO post](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on how to make a great reproducible example in R. – Eric Fail Dec 29 '15 at 18:17

2 Answers2

1

I don't know R, but if you can include one line of javascript in it, this is the line:

var encoded = 'H&oacute;la';
var notEncoded = encoded.replace("&oacute;", "ò");

Then, get the notEncoded value in your .R program.

jack
  • 1,391
  • 6
  • 21
  • Thanks but that code does not help me. I I know replace a part of text in R but I need a specific function becouse I use with other words with other accents vowels. – pescobar Dec 29 '15 at 19:34
1

Why not look up all the HTML &codes; for accented characters then find/replace?

library(rvest)

# scrape lookup table of accented char html codes, from the 2nd table on this page
ref_url <- 'http://www.w3schools.com/charsets/ref_html_8859.asp'
char_table <- html(ref_url) %>% html_table %>% `[[`(2)
# fix names
names(char_table) <- names(char_table) %>% tolower %>% gsub(' ', '_', .)

# here's a test string loaded with different html accents
test_str <- '&Agrave; &Aacute; &Acirc; &Atilde; &Auml; &Aring; &AElig; &Ccedil; &Egrave; &Eacute; &Ecirc; &Euml; &Igrave; &Iacute; &Icirc; &Iuml; &ETH; &Ntilde; &Ograve; &Oacute; &Ocirc; &Otilde; &Ouml; &times; &Oslash; &Ugrave; &Uacute; &Ucirc; &Uuml; &Yacute; &THORN; &szlig; &agrave; &aacute; &acirc; &atilde; &auml; &aring; &aelig; &ccedil; &egrave; &eacute; &ecirc; &euml; &igrave; &iacute; &icirc; &iuml; &eth; &ntilde; &ograve; &oacute; &ocirc; &otilde; &ouml; &divide; &oslash; &ugrave; &uacute; &ucirc; &uuml; &yacute; &thorn; &yuml;'

# use mgsub from here (it's just gsub with a for loop)
# http://stackoverflow.com/questions/15253954/replace-multiple-arguments-with-gsub
mgsub(char_table$entity_name, char_table$character, test_str)

And voil&agrave;, there you are:

"À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ"

arvi1000
  • 9,393
  • 2
  • 42
  • 52