Parsing HTML and get all the nodes

Question

I need to parse an HTML file in java. Unlike XML there is no repetitive tags. So I need a code that can parse the html file and reach all nodes, it includes nested tags .. etc. The HTML code is not fixed. In other words given any HTML code I need to reach all the tags in the HTML.

this question is related http://stackoverflow.com/questions/9664778/parsing-html-in-java — om-nom-nom, Mar 13 '12 at 09:24

score 1 · Answer 1 · answered Mar 13 '12 at 06:17

1

try this HTML Parser http://htmlparser.sourceforge.net/samples.html

answered Mar 13 '12 at 06:17

Abhishek Choudhary

8,255
19
69
128

Hmmm was not able to understand. Could u pls ellaborate. – Saicharan S M Mar 13 '12 at 06:30
This is a kind of HTML Parser you can use in java which will return you all the html contents in xml format like tags will be set to nodes and other text contents and all. CHeck examples – Abhishek Choudhary Mar 13 '12 at 06:40
The examples are all command line. I couldn;t find the java example. Sorry for bugging u. Im an amateur. – Saicharan S M Mar 13 '12 at 06:50
The examples may all be command line, but they also include links to the Javadocs of the relevant API classes involved. For example, in the entry for `Lexer`, it says, "Print the low level nodes of a web page" which sounds just like what you're looking for. It links to [here](http://htmlparser.sourceforge.net/javadoc/org/htmlparser/lexer/Lexer.html). The source code to the whole thing is also available for study. Now—what have you tried? – Alistair A. Israel Mar 13 '12 at 06:54
I have tried jericho, jtidy, jsoup. But i cant figure it out. I cant find any concrete example code anywhere on the net to parse n reach all the tags in an html. – Saicharan S M Mar 13 '12 at 06:59

score 0 · Answer 2 · edited Mar 13 '13 at 15:02

0

I think you need this...

var els=document.getElementsByTagName("*");
for(var i=0;i<els.length;i+)document.write(els.nodeName+"<br />");

edited Mar 13 '13 at 15:02

CoffeeRain

4,460
4
31
50

answered Mar 13 '12 at 06:16

Vinoth Kumar

93
1
1
8

No it doesnt parse the inner most nodes. U have anyother ideas? – Saicharan S M Mar 13 '12 at 06:25
Ya there are similar methods in java also. I tried it. it doesnt work. – Saicharan S M Mar 13 '12 at 06:25

Parsing HTML and get all the nodes

2 Answers2