I use JSoup to extract the contents of a URL as HTML. But in place of characters like '-' (Hiphen), and ' (Apostrophe), I get weird symbols. I dont see these symbols through view page source.
Below is the code I use:
String url = "http://www.novotreeminds.com/job-details.html#chief";
org.jsoup.nodes.Document document = org.jsoup.Jsoup.connect(url).get();
document = Jsoup.connect(url).timeout(20000)
.method(Connection.Method.GET)
.ignoreContentType(true).execute().parse();
document.outputSettings(new Document.OutputSettings().prettyPrint(false));
System.out.println(document);
In the extracted contents, instead of
Experience: 6 – 10 Years
I see:
Experience: 6 � 10 Years
This happens in the case of apsotrophe as well. I also see another square-like-symbol instead of the above weird symbol.enter image description here
Thanks, Akhila
Hi @AHungerArtist,
i have tried the code below (specified the character encoding used in the URL)
File input = new File("/home/Documents/NovoTree Minds.html");
Document doc = Jsoup.parse(input, "iso-8859-1", "");
But i see the same result
Thanks, Akhila