0

I have a string like 4,0 — 10,0 I need to decode it to: 4,0 — 10,0 this code can be checked in https://www.codetable.net/decimal/151

I tried Apache's StringEscapeUtils.unescapeJava without any luck.

Lii
  • 11,553
  • 8
  • 64
  • 88
FrancMo
  • 2,459
  • 2
  • 19
  • 39
  • 6
    Check the documentation: `unescapeJava`, as the name says, handles *Java* escape sequences. What you have is a HTML entity. Try using `unescapeHtml4`. – Konrad Rudolph Sep 08 '22 at 07:46
  • 4
    Maybe this will help? [What is the difference between EM Dash #151; and #8212;?](https://stackoverflow.com/questions/631406/what-is-the-difference-between-em-dash-151-and-8212) – Abra Sep 08 '22 at 07:50

1 Answers1

2

It is a numerical entity, common in HTML, XML, and their base, SGML.

Try apache's StringEscapeUtils.unescapeHTML*. This will also take care of named entities like —.

Or do it yourself:

Pattern entityPattern = Pattern.compile("\\&#(\\d+);");
String s = "4,0 — 10,0";
s = entityPattern.matcher(s).replaceAll(mr
        -> new String(int[] {Integer.parseInt(mr.group(1))}, 0, 1);

This does create a string with one Unicode code point of 151. For hexadecimal numeric entities:

Pattern entityPattern = Pattern.compile ("\\&#x([\\da-f]+);",
        Pattern.CASE_INSENSITIVE);
String s = "4,0 — 10,0";
s = entityPattern.matcher(s).replaceAll(mr
        -> new String(int[] {Integer.parseInt(mr.group(1), 16)}, 0, 1);

If you got this string from an HTML form when the user entered/pasted special characters, you forgot in the form:

<form action="..." accept-charset="UTF-8">

Without this, special characters are converted to numeric entities.

This assumes that the web server already uses UTF-8 for its pages.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • 3
    It’s `Pattern.compile`, not `Pattern.compiler`. And there is no reason to use `[x]` to match an `x`, as you can just use `x` for that. Further, you might want to consume the semicolon if there is one. By the way, you can also use `Integer.parseInt(s, mr.start(1), mr.end(1), 16)` to avoid substring creation. – Holger Sep 08 '22 at 08:55
  • @Holger I was too much hurrying, no more stackoverflow for me today. I first wanted to do `[xX]` but then came at casse-insensitive for the hex letters. Semicolon is required (though I sometimes saw them missing). – Joop Eggen Sep 08 '22 at 09:21
  • 2
    Afaik, HTML allows to omit the semicolon if the entity is not followed by a word character, but the processor obviously has to consume it if present. – Holger Sep 08 '22 at 11:04