2

I'm working with an HTML doc that contains the word "Español", however in the source code it's written as "Espa&# 241;ol" (Space added so it's not automatically changed by your browser)

If I do something like this, the "Español" is NOT found:

        if (source.contains("Español")) 
        System.out.println("Found it");

If I do this, the word IS found:

        if (source.contains("Español")) 
        System.out.println("Found it");

Can anyone provide insight into what's going on?

Andrio
  • 1,852
  • 2
  • 25
  • 54
  • Either `"Espa\u00F1ol"` or nowadays use UTF-8 for editing and compiling, and type `"Español"`. By the way decimal 241 = hexadecimal 0xF1 – Joop Eggen Apr 07 '15 at 12:52

3 Answers3

3

The top piece of code uses the HTML encoding for the ñ character, the bottom piece does not. The .contains() method searches for the exact input string, meaning the top piece of code is searching for the exact string, "Espa&# 241;ol", which cannot be found, since it is not in the String source.

holtc
  • 1,780
  • 3
  • 16
  • 35
  • Thanks for the input, however it's the opposite that is happening. The source contains the HTML encoding, and yet attempting to search for if "Espa 241;ol" does not work. Yet, searching for "Español" does work. – Andrio Apr 07 '15 at 13:58
  • When java converts the document text to a string, it probably parses that encoding into an actual character, rather than the encoding – holtc Apr 07 '15 at 14:00
1

After getting text from HTML, you need to convert HTML-escaped characters (such as ñ) into Unicode characters (ñ) first. Good approach is to use Apache Commons Lang library.

In your case:

input = StringEscapeUtils.unescapeHtml4(input);

will perform HTML->Unicode transformation.

Alex Salauyou
  • 14,185
  • 5
  • 45
  • 67
1

You will need to un-escape the characters before checking.

Quoting Kevin Hakanson's answer from this question.

You can use the Apache Commons StringEscapeUtils.unescapeHtml4() for this.

So in your case, provided that you've added the Apache Commons Lang library, the following code snippet should work as expected:

if (source.contains(StringEscapeUtils.unescapeHtml4("Español"))) 
    System.out.println("Found it");
Community
  • 1
  • 1
Ceiling Gecko
  • 3,104
  • 2
  • 24
  • 33