2

I am using HtmlUnit to read content from a web site.

Everything works perfectly to the point where I am reading the content with:

  HtmlDivision div = page.getHtmlElementById("my-id");

Even div.asText() returns the expected String object, but I want to get the original HTML inside <div>...</div> as a String object. How can I do that?

I am not willing to change HtlmUnit to something else, as the web site expects the client to run JavaScript, and HtmlUnit seems to be capable of doing what is required.

AlexH
  • 2,650
  • 1
  • 27
  • 35
masa
  • 2,762
  • 3
  • 21
  • 32

1 Answers1

0

If by original HTML you mean the HTML code that HTMLUnit has already formatted then you can use div.asXml(). Now, if you really are looking for the original HTML the server sent you then you won't find a way to do so (at least up to v2.14).

Now, as a workaround, you could get the whole text of the page that the server sent you with this answer: How to get the pure raw HTML of a page in HTMLUnit while ignoring JavaScript and CSS?

As a side note, you should probably think twice why you need the HTML code. HTMLUnit will let you get the data from the code, so there shouldn't be any need to store the source code but rather the information it is contained in it. Just my 2 cents.

Community
  • 1
  • 1
Mosty Mostacho
  • 42,742
  • 16
  • 96
  • 123
  • By "original HTML" I mean the HTML that is there after the page load and the initial JavaScript run. `div.asText()` shows that the content is there, so the only problem is to get it in HTML. `div.asXml()` does not return the plain HTML (already tried that). – masa May 07 '14 at 16:42
  • With a closer look, it seems that `div.asXml()` works well with well-formed pages. It is those badly formed pages that still cause problems, but maybe a small hack will do. I know that copy-pasting this 'original' HTML makes little sense, I just want to keep the headlines, paragraphs etc. and those are in the XML returned by `asXml()`. – masa May 07 '14 at 19:09