Parsing Random Web Pages

Question

I need to parse a bunch of random pages and add them to a DB. I am thinking of using regular expressions but I was wondering if there are any 'special' techniques (other than looking for content between known text/tags). The content is more(not always) like:

Some Title
Text related to Title

I guess I don't need to extract complete Text but some way to know where the Title/Paragraph and extract the content from there. The content itself may have images/links that I would like to retain.

Thanks!

Quick, somebody link to that "don't parse HTML with regexs" rant! — Dean Harding, Sep 21 '10 at 10:08
Since HTML is almost XML, you could use any old XML parser to find the `/html/head/title` etc. — bzlm, Sep 21 '10 at 10:10
Since HTML can be ill-formed and still be tolerated by a browser, you'll be surprised at how bad it is. An XML parser will often be baffled by broken XML and a regular expression can never work on practical HTML parsing. — S.Lott, Sep 21 '10 at 10:11

score 1 · Accepted Answer · edited May 23 '17 at 12:01

1

Please see this answer: RegEx match open tags except XHTML self-contained tags

edited May 23 '17 at 12:01

Community

1
1

answered Sep 21 '10 at 10:11

Daniel Cassidy

24,676
5
41
54

score 0 · Answer 2 · answered Sep 21 '10 at 10:10

0

Use Python. http://www.python.org/
Use Beautiful Soup. http://www.crummy.com/software/BeautifulSoup/

answered Sep 21 '10 at 10:10

S.Lott

384,516
81
508
779

Thanks! I am planning to use a .NET. – vent Sep 21 '10 at 20:50
@Venkateshwar: Please **update** your question with all the facts. Python and Beautiful Soup work perfectly in .Net – S.Lott Sep 21 '10 at 22:38

score 0 · Answer 3 · answered Sep 21 '10 at 10:18

You need to use a proper HTML parser, and extract the elements you’re interested in via the parser’s API (or via the DOM).

Since I don’t know what language you’re programming in, it’s rather difficult to recommend a parser, but some well known ones are Jericho for Java, and Beautiful Soup for Python.

Parsing Random Web Pages

3 Answers3