1

I need to clean an html string from accents and html accents code, and of course I have found a lot of codes that do this, however, none seems to work with the file i need to clean.

This file contains words like Postulación Ayudantías and also Gestión or Árbol

I found a lot of codes with text.normalize and regex use to clean the String, which work well with short strings but I'm using very long strings and those codes, which work with short string, doesn't work with long Strings

I am really lost here and I need help please!

This are the codes I tried and didnt work

Easy way to remove UTF-8 accents from a string? (return "?" for every accent in the String)

and I used regular expression to remove the html accent code but neither is working:

string=string.replaceAll("á","a");
string=string.replaceAll("é","e");
string=string.replaceAll("í","i");
string=string.replaceAll("ó","o");
string=string.replaceAll("ú","u");
string=string.replaceAll("ñ","n");     

Edit: nvm the replaceAll is working I wrote it wrong ("/á instead of "á)

Any help or ideas?

Community
  • 1
  • 1
Benjamin Jimenez
  • 984
  • 12
  • 26

2 Answers2

1

I think there are several options that would work. I would suggest that you first use StringEscapeUtils.unescapeHtml4(String) to unescape your html entities (that is convert them to their normal Java "utf-8" form). Then you could use an ASCIIFoldingFilter to filter to "ASCII" equivalents.

Elliott Frisch
  • 198,278
  • 20
  • 158
  • 249
  • Though it wasn't all I needed it help me realize my mistake while reading the sql file without using the utf-8 form, so i was unable to parse it correctly – Benjamin Jimenez Dec 08 '13 at 04:40
0

You need to differentiate whether you're talking about a whole HTML document containing tags and so forth or just a string containing HTML encoded data.

If you're working with an entire HTML document, say, something returned by fetching a web page, then the solution is really more than could fit into a stack overflow answer, since you basically need an HTML parser to navigate the data.

However, if you're just dealing with a string that's HTML encoded, then you first need to decode it. There are lots of utilities to do so, such as the Apache Commons Lang library StringEscapeUtils class. See this question for an example.

Once you've decoded the string, you need to iterate over it character by character and replace anything that's unwanted. Your current method won't work for hex encoded items, and you're going to end up having to build a huge table to cover all the possible HTML entities.

Community
  • 1
  • 1
Jherico
  • 28,584
  • 8
  • 61
  • 87
  • Actually I'm working with an SQL insert which contains HTML code as one of the fields, so the sql fields use normal accents like ó and the html fields use accents code (&acute) – Benjamin Jimenez Dec 08 '13 at 03:21
  • I don't know what you mean by SQL fields vs HTML fields... Nor do I understand why you can't simply leave the HTML as it is. If the field in the database field is supposed to contain HTML, then it's not your responsibility to decode the HTML. The eventual renderer will do that. – Jherico Dec 08 '13 at 03:26
  • @BenjaminJimenez, as long as you specify the correct encoding in the Content-Type header when sending the HTML to the browser, it's OK to leave the Unicode accent characters in the HTML. You don't have to replace them with entity references. – Wyzard Dec 08 '13 at 03:44
  • I mean insert into table1(title,html) values ("Postulación Ayudantías", ) and I must parse it to insert into table1(title,html) values ("Postulacion Ayudantias", ). And it's my responsability because my College teacher says so – Benjamin Jimenez Dec 08 '13 at 03:44