18

Currently I'm working on converting HTML codes with equivalent characters in java. I need to convert the below code to characters.

è - è
®   - ®
& - &
ñ - ñ
&   - &

I tried using the regex pattern

(&#x)([\\d|\\w]*)([\\d|\\w]*)([\\d|\\w]*)([\\d|\\w]*)(;)

When I debug, matcher.find() gives me true but the control skips the loop where I have written the code for conversion. Don't know what is happening there.

Also, is there any way to optimize this regex?

Any help is appreciated.

Exception

java.lang.NumberFormatException: For input string: "x26"
      at java.lang.NumberFormatException.forInputString(Unknown Source)
      at java.lang.Integer.parseInt(Unknown Source)
      at java.lang.Integer.parseInt(Unknown Source)
      at org.apache.commons.lang.Entities.unescape(Entities.java:683)
      at org.apache.commons.lang.StringEscapeUtils.unescapeHtml(StringEscapeUtils.java:483)
Tomalak
  • 332,285
  • 67
  • 532
  • 628
Raja Asthana
  • 2,080
  • 2
  • 19
  • 35
  • 2
    It is already answered :). [Recommended method for escaping HTML in Java](http://stackoverflow.com/questions/1265282/recommended-method-for-escaping-html-in-java) – Subhrajyoti Majumder Feb 21 '13 at 09:37

2 Answers2

36

Also, is there any way to optimize this regex?

Yes, don't use regex for this task, use Apache StringEscapeUtils from Apache commons lang:

import org.apache.commons.lang.StringEscapeUtils;
...
String withCharacters = StringEscapeUtils.unescapeHtml(yourString);

JavaDoc says:

Unescapes a string containing entity escapes to a string containing the actual Unicode characters corresponding to the escapes. Supports HTML 4.0 entities.

For example, the string "&lt;Fran&ccedil;ais&gt;" will become "<Français>"

If an entity is unrecognized, it is left alone, and inserted verbatim into the result string. e.g. "&gt;&zzzz;x" will become ">&zzzz;x".

aspiring_sarge
  • 2,355
  • 1
  • 25
  • 32
jlordo
  • 37,490
  • 6
  • 58
  • 83
  • Internally it loops over the passed string and use double sized stringBuffer to store the result. Possibly a pre-compiled optimized regex would give you desired result with better performance. What do you think? :) – Subhrajyoti Majumder Feb 21 '13 at 09:45
  • 1
    @Quoi: I would always use the solution I posted, unless profiling would show that this is a memory or runtime bottleneck, but that was never the case so far and I'm pretty sure never will be. – jlordo Feb 21 '13 at 10:05
  • I'm getting NumberFormatException for string 'A & B' – Raja Asthana Feb 21 '13 at 10:27
  • 1
    I thing only entity like '&' will be replaced to '&'. Not the Hex value '&'. Is that right? – Raja Asthana Feb 21 '13 at 10:29
  • @RajaAsthana: There's no `NumberFormatException` when I run `String html = "'A & B'"; String noHtml = unescapeHtml(html);` Your error must be elsewhere. – jlordo Feb 21 '13 at 10:29
  • 1
    @RajaAsthana: For input `"'A & B'"` you'll get the output `"'A & B'"`. – jlordo Feb 21 '13 at 10:32
  • Updated the question with the log. Kindly help – Raja Asthana Feb 21 '13 at 11:20
  • @RajaAsthana: Can you determine which line number in your file/string is causing this problem and post that line, too? – jlordo Feb 21 '13 at 11:29
  • @RajaAsthana: What version of commons lang are you using? According to [this](https://issues.apache.org/jira/browse/LANG-118) site, this error was fixed in versions 2.1 and newer. – jlordo Feb 21 '13 at 11:32
  • What to use on Android (Java)? – c0dehunter May 11 '19 at 19:58
  • by the way, how do we do this in springboot? this is tagged as deprecated in spring boot. thanks. – Artanis Zeratul Aug 05 '19 at 05:47
4

One of all the other possibilities or existing util methods could be spring-web's org.springframework.web.util.HtmlUtils.htmlUnescape.

Example usage in a self-contained Groovy script:

@Grapes(
    @Grab(group='org.springframework', module='spring-web', version='4.3.0.RELEASE')
)
import org.springframework.web.util.HtmlUtils

println HtmlUtils.htmlUnescape("La &#xE9;lite del tenis no teme al zika y jugar&#xE1; en R&#xED;o")
Michal M
  • 1,521
  • 14
  • 35