2

The text is coming from a webpage and it is already in ISO-8859-1;

First let me show you an example of the problem. Let us say that this one of the pieces of text from the webpage, Mark Helström. When I use Jsoup to parse the page, that piece of text will then turn into: Mark Helström

Here is an example of the webpage code:

<body>
    <p>Mark Helström</p>
</body>

Here is the code where I parse the webpage:

    String url = "http://localhost:8080/translator/test";
    Document doc = Jsoup.connect(url).get();

    System.out.println("charset=" + doc.outputSettings().charset());

    doc.outputSettings().charset(Charset.forName("ISO-8859-1"));

    System.out.println("charset=" + doc.outputSettings().charset());

    for(Element code : doc.select("*")) {
        System.out.println("code=" + code.ownText());               
    } 

Here is the output generated by the code above:

charset=ISO-8859-1
charset=ISO-8859-1
code=
code=
code=
code=
code=Mark Helström
cod3min3
  • 585
  • 1
  • 7
  • 23
  • Possible duplicate of [JSoup character encoding issue](http://stackoverflow.com/questions/7703434/jsoup-character-encoding-issue) – Zack Aug 10 '16 at 14:16
  • Not a possible duplicate. I tried that method and it didn't work. – cod3min3 Aug 10 '16 at 16:08
  • Can you show your code where you tried to parse the content as ISO-8859-1? What you have currently is the output charset, not the input. – Zack Aug 10 '16 at 16:21
  • @ZackTeater I updated my question so that it contains your solution. – cod3min3 Aug 11 '16 at 08:12
  • could you update the sample html? seems to be missing 5 paragraphs. It could be the ownText() method, which only displays the element's text node. If there is an embedded element, it won't display it. Such as,

    foo

    . Can you try with text() instead?
    – Zack Aug 11 '16 at 13:16

1 Answers1

0

Seems to work fine when I parse this page for the characters.

    Document doc = Jsoup
            .connect("http://stackoverflow.com/questions/38875180/jsoup-is-encoding-iso-8859-1-text-thats-on-a-webpage-to-another-encoding")
            .get();

    System.out.println("charset=" + doc.outputSettings().charset());

    doc.outputSettings().charset(Charset.forName("ISO-8859-1"));

    System.out.println("charset=" + doc.outputSettings().charset());

    for (Element code : doc.select(".post-text p code:contains(mark)"))
        System.out.println("code=" + code.ownText());

Console

charset=UTF-8
charset=ISO-8859-1
code=Mark Helström
code=Mark Helström
Zack
  • 3,819
  • 3
  • 27
  • 48