0

I'm trying to parse a HTML file using Jsoup. In this HTML there is a special character that I want to remove, (€), this is how it's originally:

<span class="price-value">
    49,99 €
</span>

However, Netbeans shows this when printing that element:

49.99 ?

Therefore, I cannot do this:

price.replace( "€", "" ).replace( ",", "." ).trim();

Neither this:

price.replace( "\\?", "" ).replace( ",", "." ).trim();

What can I do about it?

Dani M
  • 1,173
  • 1
  • 15
  • 43

3 Answers3

0

Modified from here:

To match individual characters, you can simply include them in an a character class, either as literals or via the \u20AC syntax

The unicode for the Euro is \u20AC.


Note: I'm not sure why it would be displayed as a ?, but that might be just because it's not ASCII, and might be missing in the font.

Community
  • 1
  • 1
Laurel
  • 5,965
  • 14
  • 31
  • 57
0

Use this ->

<span class="price-value">
49,99 &euro;
</span>

It is the representation of € sign in HTML

P Sharma
  • 184
  • 9
0

Netbeans shows this when printing that element

Almost certainly this is because your NetBeans console hasn't been configured to support Unicode chars, which is why you've been misled. For a solution to that, see: How to change default encoding in NetBeans 8.0

So, the document is fine, the regular expressions would have worked, and there's no need to change anything else.

Here's a minimal working example of the original document getting parsed correctly, the Euro symbol replaced, and 49.99 returned.

Element doc = Jsoup.parse("<html><body><span class=\"price-value\">49,99 €</span></body></html>");
Element span = doc.select("span").get(0);
System.out.println( span.text().replace("€", "").replace(",", ".").trim() );
Community
  • 1
  • 1
Andrew Regan
  • 5,087
  • 6
  • 37
  • 73
  • It's weird, because It has always been working until now. I've reinstalled Netbeans and now it works fine. – Dani M Apr 04 '16 at 10:51