Java - Problems to convert Html special characters

Question

I'm trying to parse an HTML page by using Xpath with JAVA. Here is my code:

        /** Cleaning the html file */
        /** the 'doc' variable is a String containing the whole html file */
        TagNode tagNode = new HtmlCleaner().clean(doc);
        Document doc2 = new DomSerializer( new CleanerProperties() ).createDOM(tagNode);




        XPath xpath = XPathFactory.newInstance().newXPath();

        /** xpath request */
        Object dates_experience = xpath.evaluate("/html/body/div[3]/div/div/div[2]/div/div/div[2]/div[4]/div/div[3]/h4/span[2]", doc2, XPathConstants.NODESET);

        NodeList nodes = (NodeList) dates_experience;
        String s;
        for (int i = 0; i < nodes.getLength(); i++) {
            s = org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4(nodes.item(i).getTextContent());
            System.out.println(s); 
        }

I think I have probems with stringEscapeUtils or with HtmlCleaner, because on the output, i see this:

ï¿½

instead of those characters:

é, è, ', à, û, ...etc

For example, I have this input:

décembre 2010 - décembre 2010)
février 2010 - juin 2010)
juillet 2009 - septembre 2009)
juin 2009 - juin 2009)
juillet 2008 - août 2008)

My program produces this output:

dï¿½cembre 2010 - dï¿½cembre 2010)
fï¿½vrier 2010 - juin 2010)
juillet 2009 - septembre 2009)
juin 2009 - juin 2009)
juillet 2008 - aoï¿½t 2008)

could you help me to solve this problem please ?

Thanks.

seems like you dont have to escape or unescpae them, as long as you respect the charset of the data. — Marvin Emil Brach, Jun 07 '13 at 13:03
Is the "doc" string properly loaded, with valid charset ? The doc string must be loaded with the charset of the document, as defined in HTML file tag. — Toilal, Jun 07 '13 at 13:05
So it dont have to be (un)escaped. UTF-8 covers all of the characters you've posted. If you're looking at the HTML source of that Page, is there a é or something other? — Marvin Emil Brach, Jun 07 '13 at 13:26
I tried your solution but, whether I use unescape or not, é û... chars are replaced by "?". So my output is d?cembre, ao?t. Plus, when I do a system.out.println(doc), these chars are also replaced by "?". — shanks_roux, Jun 07 '13 at 13:30
Then you don't use the charset the document specifies! 1. Is there any other application, javascript-file or anything between the website and your parser? 2. Look at your browsers settings, eventually the document says it's UTF-8 but isn't. Switch the encoding used by the browser from automatic to manually, set UTF-8 and look if the characters are like they should be. — Marvin Emil Brach, Jun 07 '13 at 13:34
Actually, I created an html file by copiying/pasting the page source code in notepad++. I tried to make an UTF-8 encoding through notepad but my outputs doesn't change :( — shanks_roux, Jun 07 '13 at 14:02
I also checked out my browser's configurations and I can tell you that the character encoding is UTF-8 and everything is well-displayed. — shanks_roux, Jun 07 '13 at 14:05
I think the failure is located at HTML-Cleaner (http://htmlcleaner.sourceforge.net/parameters.php). — Marvin Emil Brach, Jun 07 '13 at 14:19
ERR! possibly its just the console which don't recognize the output. have a look at this: http://stackoverflow.com/questions/10143998/cyrillic-in-windows-consolejava-system-out-println and at a windows system use "Windows-1252" as output-encoding. Alternatively write your output as UTF-8 to a File and open that in Notepad++ — Marvin Emil Brach, Jun 07 '13 at 14:22
I have the same problem when writing the output into a file (in UTF-8). :( — shanks_roux, Jun 07 '13 at 14:33

score 1 · Answer 1 · answered Jun 07 '13 at 12:55

1

I suspect you should *un*escape, not escape them: StringEscapeUtils.unescapeHtml4(String)

answered Jun 07 '13 at 12:55

qqilihq

10,794
7
48
89

Can you post some sample input/output of your data? – qqilihq Jun 07 '13 at 12:57

Java - Problems to convert Html special characters

1 Answers1