Quickest/easiest way to parse HTML of a website?

Question

I need to parse the contents of this website and store it in a MySQL database. I'm making a competitor site to that one as the creator never completely finished his, but he has newer game data than I do and won't release it, so I need to collect it manually. Here is an example of the specific type of page I need to parse.

I've done HTML parsing before with PHP and regex, but it was painfully tedious and I would much rather not go through the hassle of that again. I've been procrastinating on finishing my database for months because of this issue. Is there a faster and/or easier way of going about this? Most C-style languages are fine for me (C, C++, Perl, PHP, Python, etc., are all fine, but not C#, Java, or Objective-C).

P.S.: I don't care how dirty the script/program turns out or anything like that, so long as it gets the job done.

His data was leaked from the official game servers, so technically I'm merely "stealing" what he already stole. Besides, the data is public anyway. — delaccount992, Sep 07 '11 at 10:57
Using regexp in PHP to parse HTML file is big mistake -- http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662 — ajreal, Sep 07 '11 at 11:07

score 1 · Answer 1 · answered Jun 14 '12 at 18:13

1

You can use php with simpleHtmlDom to parse html, and simpleHtmlDom is very easy..

http://simplehtmldom.sourceforge.net/manual.htm

answered Jun 14 '12 at 18:13

Shakil Ahmed

1,481
2
19
26

mcsdwarken · Answer 2 · 2013-12-18T13:59:31.327

1

I used http://htmlagilitypack.codeplex.com/ and http://code.google.com/p/fizzler/ to parse HTML and grab necessary information. It works very well.

edited Dec 18 '13 at 13:59

answered Sep 07 '11 at 10:57

mcsdwarken

109
3

Nice ... except all designed for .Net – ajreal Sep 07 '11 at 10:59
Yeah.. that's a bit of an issue. I have no experience with .NET. If it's worth the effort though, I wouldn't mind giving it a go for this project. – delaccount992 Sep 07 '11 at 11:01

Michał Šrajer · Answer 3 · 2011-09-07T11:05:56.057

1

I did that few months ago, and after some investigation I decided to go with LXML python library. See parsing tutorial here. And yes, it's not only for xml parsing it does HTML as well.

I like it, because it's powerful, easy to use.

edited Sep 07 '11 at 11:05

answered Sep 07 '11 at 11:00

Michał Šrajer

30,364
7
62
85

score 1 · Accepted Answer · answered Sep 07 '11 at 11:01

Any of the languages you mentioned can do that, as long as you use the correct third-party libraries to help you.

You'll need something that crawls the site. Actually, this could be a completely different program that just downloads the .html files to your computer, on which you'd then let the parser run. Such robots exist, consider wget or curl -- they both have spider options.

You'll need a parser for the site. Don't use regexp to parse HTML, use an HTML or XML parser (like Perl's HTML::Parser). Then you'll have to convert the resulting datastructure to usable data (for example, first table>tr>td is monster name, second td is race, etc.

Finally, you'll need to store those into your database in a way you can recuperate them later to serve for your site.

Actually, writing the code won't be the hardest thing, but the mapping on "which item on the page means what and should be stored where and how" will be.

Quickest/easiest way to parse HTML of a website?

4 Answers4