1

I am using jsoup to get source code. I am using jsoup version 1.13.1. when I get the source code using below code I found that the case is converted to lowercase.

Document doc = Jsoup.connect("https://example.com").get();
webview.loadData(doc);

I saw several answer where they prefer xml parser. But I don't know how to use xml parser to parse html from a url. And there is also base url that I don't understand. I am working with an Android app project. So any answer will be helpful for me. Thanks in advance

  • What is being converted to lowercase? – Dave Newton Jun 29 '21 at 17:44
  • Tags, attributes are being converted to lowercase. Line breaks are missing from the source code – Encoder's YT Jun 29 '21 at 17:47
  • 1
    Tags are case-insensitive; that shouldn't matter at all. Preserving newlines (which generally doesn't matter) is covered [here](https://www.baeldung.com/jsoup-line-breaks); nutshell is to disable pretty-printing. Preserving case depends on `ParseSettings` (or at least used to) as discussed [here](https://stackoverflow.com/q/31400712/438992). – Dave Newton Jun 29 '21 at 17:51
  • 1
    If you want to get *raw* data then you can use solution from [Read url to string in few lines of java code](https://stackoverflow.com/q/4328711). `Document` is already *parsed* version of HTML structure which explains lacks of preservation of case for tag names or different line breaks (since parsers also may want to provide nicely formatted tree structure instead of often not so nice raw format). If solution from first link still doesn't help try to clarify what you *really* want to achieve and how content content of Document prevents it. – Pshemo Jun 29 '21 at 18:49

1 Answers1

0

It's easy to use a different parser than the default - either the XML parser (which preserves case and disables pretty-printing (i.e. preserves line breaks)), or the HTML parser configured similarly. Just use the Connection#parser() method:

Document document = Jsoup.connect("https://example.com")
    .parser(Parser.xmlParser())
    .get();
Document document = Jsoup.connect("https://example.com")
    .parser(Parser.htmlParser().settings(ParseSettings.preserveCase))
    .get();
document.outputSettings().prettyPrint(false);
Dharman
  • 30,962
  • 25
  • 85
  • 135
Jonathan Hedley
  • 10,442
  • 3
  • 36
  • 47