Regex for an html text code in Java

Question

I have a html text file that has headings I would like to extract the only the text inside

Example:

<h1 class="title"><a href="dtb.htm#rgn_txt_0001_0001">Fire Safety</a></h1>
<h1><a href="dtb.htm#rgn_txt_0002_0001">About this book</a></h1>
<h1><a href="dtb.htm#rgn_par_0002_0008">1</a></h1>
<h1><a href="dtb.htm#rgn_txt_0003_0001">Contents of this book</a></h1>

I would like extract only the following text from HTML code:

Fire Safety, About this book, 1, Contents of this book

I tried lot of things like:

Pattern pattern = Pattern.compile("<a[^>]href\\s=\\s*\"\\s*([^\"]*)");
Matcher matcher = pattern.matcher(input);

where input is the html data.

Didn't get any results on the console or sometimes are i am getting only href :(

How do I get to fix this?

Let me know! Thanks!

Please Please Please! Don't parse HTML with Regex. Try http://jsoup.org/ — Rohit Jain, Dec 18 '12 at 07:05
You cannot parse HTML with Regex, lest this happens again: http://stackoverflow.com/a/1732454/504685 — Charlie, Dec 18 '12 at 07:06
@RohitJain It's not that you shouldn't parse HTML with RegEx, it's that you can't. — Cubic, Dec 18 '12 at 07:12
K. why can't I use regEX. What is issue behind it? More over it is not HTML file but it is just HTML source code that are on a text? — TheDevMan, Dec 18 '12 at 08:54
@user1443051 HTML is a non-regular context free language. You can only describe regular languages with regular expressions though. See any introductory article on formal languages for details. — Cubic, Dec 18 '12 at 12:25

score 3 · Answer 1 · answered Dec 18 '12 at 07:08

3

I would strongly recommend to use an HTML parser, something like TagSoup, Jericho, NekoHTML, HTML Parser, etc

answered Dec 18 '12 at 07:08

NullPoiиteя

56,591
22
125
143

I don't have any information on the parser that are available. – TheDevMan Dec 18 '12 at 09:05
1

@user1443051 ... and that's why NullPointer gave you links to 4 of them. – Cubic Dec 18 '12 at 12:26

Regex for an html text code in Java

1 Answers1