Replace HTML codes with equivalent characters in Java

Question

Currently I'm working on converting HTML codes with equivalent characters in java. I need to convert the below code to characters.

&#x00E8; - è
&#xAE;   - ®
&#x0026; - &
&#x00F1; - ñ
&#x26;   - &

I tried using the regex pattern

(&#x)([\\d|\\w]*)([\\d|\\w]*)([\\d|\\w]*)([\\d|\\w]*)(;)

When I debug, matcher.find() gives me true but the control skips the loop where I have written the code for conversion. Don't know what is happening there.

Also, is there any way to optimize this regex?

Any help is appreciated.

Exception

java.lang.NumberFormatException: For input string: "x26"
      at java.lang.NumberFormatException.forInputString(Unknown Source)
      at java.lang.Integer.parseInt(Unknown Source)
      at java.lang.Integer.parseInt(Unknown Source)
      at org.apache.commons.lang.Entities.unescape(Entities.java:683)
      at org.apache.commons.lang.StringEscapeUtils.unescapeHtml(StringEscapeUtils.java:483)

It is already answered :). [Recommended method for escaping HTML in Java](http://stackoverflow.com/questions/1265282/recommended-method-for-escaping-html-in-java) — Subhrajyoti Majumder, Feb 21 '13 at 09:37

score 36 · Accepted Answer · edited Jan 06 '15 at 10:51

36

Also, is there any way to optimize this regex?

Yes, don't use regex for this task, use Apache StringEscapeUtils from Apache commons lang:

import org.apache.commons.lang.StringEscapeUtils;
...
String withCharacters = StringEscapeUtils.unescapeHtml(yourString);

JavaDoc says:

Unescapes a string containing entity escapes to a string containing the actual Unicode characters corresponding to the escapes. Supports HTML 4.0 entities.

For example, the string "<Français>" will become "<Français>"

If an entity is unrecognized, it is left alone, and inserted verbatim into the result string. e.g. ">&zzzz;x" will become ">&zzzz;x".

edited Jan 06 '15 at 10:51

aspiring_sarge

2,355
1
25
32

answered Feb 21 '13 at 09:34

jlordo

37,490
6
58
83

Internally it loops over the passed string and use double sized stringBuffer to store the result. Possibly a pre-compiled optimized regex would give you desired result with better performance. What do you think? :) – Subhrajyoti Majumder Feb 21 '13 at 09:45
1

@Quoi: I would always use the solution I posted, unless profiling would show that this is a memory or runtime bottleneck, but that was never the case so far and I'm pretty sure never will be. – jlordo Feb 21 '13 at 10:05
I'm getting NumberFormatException for string 'A & B' – Raja Asthana Feb 21 '13 at 10:27
1

I thing only entity like '&' will be replaced to '&'. Not the Hex value '&'. Is that right? – Raja Asthana Feb 21 '13 at 10:29
@RajaAsthana: There's no `NumberFormatException` when I run `String html = "'A & B'"; String noHtml = unescapeHtml(html);` Your error must be elsewhere. – jlordo Feb 21 '13 at 10:29
1

@RajaAsthana: For input `"'A & B'"` you'll get the output `"'A & B'"`. – jlordo Feb 21 '13 at 10:32
Updated the question with the log. Kindly help – Raja Asthana Feb 21 '13 at 11:20
@RajaAsthana: Can you determine which line number in your file/string is causing this problem and post that line, too? – jlordo Feb 21 '13 at 11:29
@RajaAsthana: What version of commons lang are you using? According to [this](https://issues.apache.org/jira/browse/LANG-118) site, this error was fixed in versions 2.1 and newer. – jlordo Feb 21 '13 at 11:32
What to use on Android (Java)? – c0dehunter May 11 '19 at 19:58
by the way, how do we do this in springboot? this is tagged as deprecated in spring boot. thanks. – Artanis Zeratul Aug 05 '19 at 05:47

score 4 · Answer 2 · answered Jun 25 '16 at 19:03

4

One of all the other possibilities or existing util methods could be spring-web's org.springframework.web.util.HtmlUtils.htmlUnescape.

Example usage in a self-contained Groovy script:

@Grapes(
    @Grab(group='org.springframework', module='spring-web', version='4.3.0.RELEASE')
)
import org.springframework.web.util.HtmlUtils

println HtmlUtils.htmlUnescape("La &#xE9;lite del tenis no teme al zika y jugar&#xE1; en R&#xED;o")

answered Jun 25 '16 at 19:03

Michal M

1,521
14
35

This answer is more suitable for Springboot :) – Artanis Zeratul Aug 05 '19 at 21:31

Replace HTML codes with equivalent characters in Java

2 Answers2

Linked