The text is coming from a webpage and it is already in ISO-8859-1
;
First let me show you an example of the problem. Let us say that this one of the pieces of text from the webpage, Mark Helström
. When I use Jsoup
to parse the page, that piece of text will then turn into: Mark Helström
Here is an example of the webpage code:
<body>
<p>Mark Helström</p>
</body>
Here is the code where I parse the webpage:
String url = "http://localhost:8080/translator/test";
Document doc = Jsoup.connect(url).get();
System.out.println("charset=" + doc.outputSettings().charset());
doc.outputSettings().charset(Charset.forName("ISO-8859-1"));
System.out.println("charset=" + doc.outputSettings().charset());
for(Element code : doc.select("*")) {
System.out.println("code=" + code.ownText());
}
Here is the output generated by the code above:
charset=ISO-8859-1
charset=ISO-8859-1
code=
code=
code=
code=
code=Mark Helström
foo
. Can you try with text() instead? – Zack Aug 11 '16 at 13:16