2

I use JSoup to extract the contents of a URL as HTML. But in place of characters like '-' (Hiphen), and ' (Apostrophe), I get weird symbols. I dont see these symbols through view page source.

Below is the code I use:

String url = "http://www.novotreeminds.com/job-details.html#chief";

org.jsoup.nodes.Document document = org.jsoup.Jsoup.connect(url).get();
document = Jsoup.connect(url).timeout(20000)
            .method(Connection.Method.GET)
            .ignoreContentType(true).execute().parse();

document.outputSettings(new Document.OutputSettings().prettyPrint(false));
    System.out.println(document);

In the extracted contents, instead of

Experience: 6 – 10 Years

I see:

Experience: 6 � 10 Years

This happens in the case of apsotrophe as well. I also see another square-like-symbol instead of the above weird symbol.enter image description here

Thanks, Akhila

Hi @AHungerArtist,

i have tried the code below (specified the character encoding used in the URL)

File input = new File("/home/Documents/NovoTree Minds.html");
Document doc = Jsoup.parse(input, "iso-8859-1", "");

But i see the same result

Thanks, Akhila

  • I don't really have an answer for you but I'm sure it's something to do with a difference in character encoding used for the source and your parser. You might see what the default character encoding is in JSoup and then try to set a different one when parsing the page. – AHungerArtist Apr 09 '17 at 18:53
  • 1
    Possible duplicate of [How do I make eclipse print out weird characters in unicode?](http://stackoverflow.com/questions/6233775/how-do-i-make-eclipse-print-out-weird-characters-in-unicode) – Lyubomyr Shaydariv Apr 09 '17 at 19:48
  • This solution seems to work: http://stackoverflow.com/questions/7714879/strange-encoding-behaviour-with-jsoup – Tim Apr 10 '17 at 08:21

0 Answers0