The best tool for HTML to Lynx style text I have come across is Jericho's Renderer.
It's easy to use:
Source source=new Source(new URL(sourceUrlString)); // or new Source("<html>pass in raw html string</html>");
String renderedText=source.getRenderer().toString();
System.out.println("\nSimple rendering of the HTML document:\n");
System.out.println(renderedText);
(from here)
and handles HTML in the wild (badly formatted) very well.
Here's the first few lines of this page formatted this way using Jericho:
Stack Exchange log in | careers | chat
| meta | about | faq
Stack Overflow
* Questions
* Tags
* Users
* Badges
* Unanswered
* Ask Question
Java HTML normalizer?
**
IS there a library which can transform
any given HTML page with JS, CSS all
over it, into a minimalistic uniform
format?
For instance, if we render
stackoverflow homepage, I want it to
be shown in a minimal format. I want
all other sites to be rendered down.
Sort of like Lynx web browser but with
minimal graphics.
java lynx link|edit|flag asked 2 days
ago Kim Jong Woo 593112 89% accept
rate Do you want to transform your
HTML code to simpler HTML code, or do
your want to show this "minimalistic
uniform format" to your user? Or do
you want to create a image? – Paŭlo
Ebermann yesterday simpler html code
without sacrificing the relative
positioning of the elements. – Kim
Jong Woo 16 hours ago
2 Answers
To answer your firtst question: No. I
don'nt think there is a library for
that purpose. (At least this is what
my "googeling" resulted in).
And i think the reason for this is,
that what you want is a very special
need.
So as a solution for your problem you
can parse the html and display it the
way you want to in a JEditorpane or
whatever you are using for display.
I can only suggest a way i would do it
(this is because i am familiar with
xml and everything around it).
*
Use a library to ensure that your html conforms to xhtml:
http://htmlcleaner.sourceforge.net/release.php
*
then either parse the xml with DOM or SAX parsers and display it the
way you want.
or
* use xslt to transform the document into some other html document
which results in a view that fits your
needs.
or
* use one of the available html parser librarys. (The most of which i
found where kind of outdated (2006))
but they could be an option for you.
This is just one suggestion how you
could do it. I'm sure there are
thousands of other ways which will do
the same thing.