0

I want to open a webpage (whose URL is given as the commandline argument) and then want to save the content of that webpage as a .txt file.

Remember, I need the .txt file and not the source of the webpage.

I tried my hand with selenium and it works fine. But now I want something that doesn't open the real browser as opening the browser and loading a page in it is a time consuming task.

I want to do it in java.

By content, I mean the text (without markups) which we get when we save a webpage in IE by going to "Save As" and then selecting ".txt" as the output format of the file.

Valentin Rocher
  • 11,667
  • 45
  • 59
Amit
  • 33,847
  • 91
  • 226
  • 299
  • What do you mean by the `content`? Do you want to strip out the HTML tags or just save the HTML file as a .txt file? – Earlz Jan 14 '10 at 15:11

2 Answers2

3

If I understand correctly your question, you want to render the page and copy the rendered text without using a navigator.

For this, you'll need a headless browser. HTMLUnit would be a good choice.

To get the text content, you could do it like this (not tested) :

WebClient c = new WebClient(BrowserVersion.INTERNET_EXPLORER_6);
TextPage tp = c.getPage("yoururl");
String content = tp.getContent();

(see Javadoc)

Valentin Rocher
  • 11,667
  • 45
  • 59
  • Yes, you have understood my question correctly. I have opened the webpage in that headless browser provided by HTMLUnit. But now, I don't know how to save the HtmlPage as to output the desired file. – Amit Jan 14 '10 at 15:09
  • Yes, I have seen it and trying it. It is throwing some exceptions and am trying to find the cause... Thanks for that. – Amit Jan 14 '10 at 15:37
-1

Hmm, I'd even code that from scratch, does not seem as a complex thing and might not be even worth adding a dependency on another library to your project:

  • Open a URLConnection to that URL
  • Get a stream from the connection, apply regex to strip out all the HTML to the data. If the page is not expected to be too large for you memory requirements :) read the page into a String then apply the regex. Alternatively, give a shoot to what's described here (I have no experience with the way described there though).
  • Save output to a txt.
Community
  • 1
  • 1
david a.
  • 5,283
  • 22
  • 24