0

I'm trying to parse an HTML page by using Xpath with JAVA. Here is my code:

        /** Cleaning the html file */
        /** the 'doc' variable is a String containing the whole html file */
        TagNode tagNode = new HtmlCleaner().clean(doc);
        Document doc2 = new DomSerializer( new CleanerProperties() ).createDOM(tagNode);




        XPath xpath = XPathFactory.newInstance().newXPath();

        /** xpath request */
        Object dates_experience = xpath.evaluate("/html/body/div[3]/div/div/div[2]/div/div/div[2]/div[4]/div/div[3]/h4/span[2]", doc2, XPathConstants.NODESET);

        NodeList nodes = (NodeList) dates_experience;
        String s;
        for (int i = 0; i < nodes.getLength(); i++) {
            s = org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4(nodes.item(i).getTextContent());
            System.out.println(s); 
        }

I think I have probems with stringEscapeUtils or with HtmlCleaner, because on the output, i see this:

�

instead of those characters:

é, è, ', à, û, ...etc

For example, I have this input:

décembre 2010 - décembre 2010)
février 2010 - juin 2010)
juillet 2009 - septembre 2009)
juin 2009 - juin 2009)
juillet 2008 - août 2008)

My program produces this output:

d�cembre 2010 - d�cembre 2010)
f�vrier 2010 - juin 2010)
juillet 2009 - septembre 2009)
juin 2009 - juin 2009)
juillet 2008 - ao�t 2008)

could you help me to solve this problem please ?

Thanks.

shanks_roux
  • 438
  • 2
  • 12
  • 26
  • 1
    seems like you dont have to escape or unescpae them, as long as you respect the charset of the data. – Marvin Emil Brach Jun 07 '13 at 13:03
  • 1
    Is the "doc" string properly loaded, with valid charset ? The doc string must be loaded with the charset of the document, as defined in HTML file tag. – Toilal Jun 07 '13 at 13:05
  • yes, I loaded it in UTF-8 as defined in the HTML file – shanks_roux Jun 07 '13 at 13:14
  • So it dont have to be (un)escaped. UTF-8 covers all of the characters you've posted. If you're looking at the HTML source of that Page, is there a é or something other? – Marvin Emil Brach Jun 07 '13 at 13:26
  • I tried your solution but, whether I use unescape or not, é û... chars are replaced by "?". So my output is d?cembre, ao?t. Plus, when I do a system.out.println(doc), these chars are also replaced by "?". – shanks_roux Jun 07 '13 at 13:30
  • Then you don't use the charset the document specifies! 1. Is there any other application, javascript-file or anything between the website and your parser? 2. Look at your browsers settings, eventually the document says it's UTF-8 but isn't. Switch the encoding used by the browser from automatic to manually, set UTF-8 and look if the characters are like they should be. – Marvin Emil Brach Jun 07 '13 at 13:34
  • Actually, I created an html file by copiying/pasting the page source code in notepad++. I tried to make an UTF-8 encoding through notepad but my outputs doesn't change :( – shanks_roux Jun 07 '13 at 14:02
  • I also checked out my browser's configurations and I can tell you that the character encoding is UTF-8 and everything is well-displayed. – shanks_roux Jun 07 '13 at 14:05
  • I think the failure is located at HTML-Cleaner (http://htmlcleaner.sourceforge.net/parameters.php). – Marvin Emil Brach Jun 07 '13 at 14:19
  • ERR! possibly its just the console which don't recognize the output. have a look at this: http://stackoverflow.com/questions/10143998/cyrillic-in-windows-consolejava-system-out-println and at a windows system use "Windows-1252" as output-encoding. Alternatively write your output as UTF-8 to a File and open that in Notepad++ – Marvin Emil Brach Jun 07 '13 at 14:22
  • I have the same problem when writing the output into a file (in UTF-8). :( – shanks_roux Jun 07 '13 at 14:33

1 Answers1

1

I suspect you should *un*escape, not escape them: StringEscapeUtils.unescapeHtml4(String)

qqilihq
  • 10,794
  • 7
  • 48
  • 89