23

I am using JSoup to parse content from http://www.latijnengrieks.com/vertaling.php?id=5368 . this is a third party website and does not specify proper encoding. i am using the following code to load the data:

public class Loader {

    public static void main(String[] args){
        String url = "http://www.latijnengrieks.com/vertaling.php?id=5368";

        Document doc;
        try {

            doc = Jsoup.connect(url).timeout(5000).get();
            Element content = doc.select("div.kader").first();
            Element contenttableElement = content.getElementsByClass("kopje").first().parent().parent();

            String contenttext = content.html();
            String tabletext = contenttableElement.html();

            contenttext = Jsoup.parse(contenttext).text();
            contenttext = contenttext.replace("br2n", "\n");
            tabletext = Jsoup.parse(tabletext.replaceAll("(?i)<br[^>]*>", "br2n")).text();
            tabletext = tabletext.replace("br2n", "\n");

            String text = contenttext.substring(tabletext.length(), contenttext.length());
            System.out.println(text);


        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }


    }    

}

this gives the following output:

Aeneas dwaalt rond in Troje en zoekt Cre?sa. Cre?sa is echter op de vlucht gestorven Plotseling verschijnt er een schim. Het is de schim van Cre?sa. De schim zegt:'De oorlog woedt!' Troje is ingenomen! Cre?sa is gestorven:'Vlucht!' Aeneas vlucht echter niet. Dan spreekt de schim:'Vlucht! Er staat jou een nieuw vaderland en een nieuw koninkrijk te wachten.' Dan pas gehoorzaamt Aeneas en vlucht.

is there any way the ? marks can be the original (ü) again in the output?

Hihaatje
  • 263
  • 1
  • 2
  • 8

4 Answers4

52

The charset attribute is missing in HTTP response Content-Type header. Jsoup will resort to platform default charset when parsing the HTML. The Document.OutputSettings#charset() won't work as it's used for presentation only (on html() and text()), not for parsing the data (in other words, it's too late already).

You need to read the URL as InputStream and manually specify the charset in Jsoup#parse() method.

String url = "http://www.latijnengrieks.com/vertaling.php?id=5368";
Document document = Jsoup.parse(new URL(url).openStream(), "ISO-8859-1", url);
Element paragraph = document.select("div.kader p").first();

for (Node node : paragraph.childNodes()) {
    if (node instanceof TextNode) {
        System.out.println(((TextNode) node).text().trim());
    }
}

this results here in

Aeneas dwaalt rond in Troje en zoekt Creüsa.
Creüsa is echter op de vlucht gestorven
Plotseling verschijnt er een schim.
Het is de schim van Creüsa.
De schim zegt:'De oorlog woedt!'
Troje is ingenomen!
Creüsa is gestorven:'Vlucht!'
Aeneas vlucht echter niet.
Dan spreekt de schim:'Vlucht! Er staat jou een nieuw vaderland en een nieuw koninkrijk te wachten.'
Dan pas gehoorzaamt Aeneas en vlucht.
BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
  • 1
    **That's** the answer I'm looking for! Thanks again Balus, and 5+ if I could! – Hovercraft Full Of Eels Oct 10 '11 at 18:39
  • @Hovercraft: you're welcome. By the way, Jonathan has added `Element#textNodes()` for the upcoming Jsoup 1.6.2 which should make the `instanceof` check superfluous. You could just do `for (TextNode node : paragraph.textNodes())`. See also http://stackoverflow.com/questions/7164376/how-to-extract-separate-text-nodes-with-jsoup/7164518#7164518 – BalusC Oct 10 '11 at 18:45
  • 1
    Actually, the presence of "Content-Type" with a valid charset wouldnt change Jsoup behaviour when called with Jsoup.parse(someText) method. You need to call Jsoup.parse(inputStream, null, baseUrl) or some similar methods to have Jsoup detect charset. – Tristan Nov 17 '15 at 12:28
  • This answer can solve urls contains other characters also. Great! @BalusC – malajisi Jun 08 '17 at 03:19
  • Is it possible to add parameters in this way? Similarly Jsoup.connect(url).data(params) – Reva Junior Aug 24 '17 at 01:00
  • "Jsoup.parse(inputStream, null, baseUrl)" this one is also a good solution. Thanks @Tristan – Md. Sajedul Karim Sep 21 '17 at 18:35
17

Well, I figured out another way to do that. In my case, I had an Jsoup Connection object and I wanted to retrieve the html response from a post() request in a website that was encoded with "ISO-8859". As the default encoding for JSOUP is UTF-8, the content from the response (the html) was coming with � replacing some letters. I needed to somehow convert it to ISO-8859-15. To perform that, I've created the connection

Connection connectionTest = Jsoup.connect("URL")
.cookie("cookiereference", "cookievalue")
.method(Method.POST);

After that, I've created a response Document that holds the answer of the post. Due to the fact that it was not clear how we can set the encoding of the response in Jsoup, I opted to execute the post and then save the response as Bytes, preserving the encoding properties. After that, I've created a new String passing this Byte array and the proper encoding that must be applied. After that, the document will be created with the correct encoding.

Document response = Jsoup.parse(new String(
connectionTest.execute().bodyAsBytes(),"ISO-8859-15"));

So, there is the return before and after the modification, when we use response.html()

Before:

62.09-1-00 - Suporte t�cnico, manuten��o e outros servi�os em tecnologia da informa��o

After:

62.09-1-00 - Suporte técnico, manutenção e outros serviços em tecnologia da informação

hugoeiji
  • 271
  • 4
  • 7
  • You can not use above as general method. How it will work if website that you are hitting in another encoding. Can we make it general. Or be specific like if website is in ISO-8859, then run this code, otherwise run default ( Jsoup.parse(execute.body(), url)) – Asad Rao Apr 05 '20 at 07:55
7

The Jsoup documentation states that Jsoup should automatically detect the correct charset when reading in the document, but for some reason, it's not working for me. I then tried to manually set the Document's charset using outputSettings().charset(...):

doc.outputSettings().charset("ISO-8859-1");

But that still didn't work, so perhaps I'm doing it wrong (I'm just learning Jsoup).

One work-around that did work, at least for me, was to read in the web page using a Scanner that had its charset set:

     String charset = "ISO-8859-1";

     URL myUrl = new URL(url);
     Scanner urlScanner = new Scanner(myUrl.openStream(), charset);
     StringBuilder sb = new StringBuilder();
     while (urlScanner.hasNextLine()) {
        sb.append(urlScanner.nextLine() + "\n");
     }
     urlScanner.close();

     doc = Jsoup.parse(sb.toString());

But I'll be following this thread to see if anyone comes up with a better suggestion, one that doesn't need the use of another class to read in the HTML.

Hovercraft Full Of Eels
  • 283,665
  • 25
  • 256
  • 373
-1

I used:

public static String charset = "UTF-8";
doc = Jsoup.parse(new URL(theURL).openStream(), charset, theURL);

Also, saved the class as UTF-8