0

So I am trying to develop a program that will parse a website for data, send that data into variable that I can then use for functions inside the program.

Specifically I'm trying to parse this page (Click the debuffs tab)

http://worldoflogs.com/reports/rt-1smdoscr7neq0k6b/spell/94075/

The source is pretty simple and looks like this.

    <td><a href='/reports/rt-1smdoscr7neq0k6b/details/62/' class='actor'><span class='Warrior'>Zonnza</span></a></td>
    <td>100</td>
</tr>
<tr>
    <td><a href='/reports/rt-1smdoscr7neq0k6b/details/3/' class='actor'><span class='DeathKnight'>Fillzholez</span></a></td>
    <td>89</td>
</tr>

While I only want the numbers and name, ex what is between <td></td> and between the <span class=''></span> tags. Is there anyway to do what I'm looking for?

Any help would be greatly appreciated.

ildjarn
  • 62,044
  • 9
  • 127
  • 211
Cistoran
  • 1,587
  • 15
  • 36
  • 54

3 Answers3

2

I'd look into Tag Soup. It's a parser for HTML that can cope with all the horrible HTML that's out there. There's a C++ port of it available too (haven't used that so can't comment on how stable it is).

Jeff Foster
  • 43,770
  • 11
  • 86
  • 103
1

There are no C++ libraries for what you're trying to do (unless you're going to link with a half of Mozilla or WebKit), but you can consider using Java with HTMLUnit.

And for those suggesting regular expressions, an obligatory reference.

Community
  • 1
  • 1
SK-logic
  • 9,605
  • 1
  • 23
  • 35
0

There's no need to use C++, when C-style sscanf will do, or even perl or any language with regular expression support.

Andy Finkenstadt
  • 3,547
  • 1
  • 21
  • 25
  • I'm looking to specifically use C++ or possibly c# as those are the only actual programming languages that I know. – Cistoran Apr 08 '11 at 15:17
  • @Cistoran, it is not a valid reason for choosing a language. Better select languages on libraries availability. But if you're ok with C#, you can use HTMLUnit with iKVM, at least it works for me. – SK-logic Apr 08 '11 at 15:24
  • @SK-Logic I'm not saying it is, but those are the only languages I've had a chance/opportunity to learn thus far. – Cistoran Apr 08 '11 at 15:26
  • @Cistoran, Java is almost a subset of C#, so you won't have any troubles using it. – SK-logic Apr 08 '11 at 15:34
  • @SK-logic I'm already using the Tag-Soup port for C++ but thanks. – Cistoran Apr 08 '11 at 15:39
  • @Cistoran, HTMLunit is much more robust - it allows you to extract data from AJAX pages too. The only way to extract data properly from the modern web-pages is to simulate a full-blown browser, no simple parsing will be generic enough. – SK-logic Apr 08 '11 at 15:45
  • There's no need to use C, or C++, or even `perl` input when there are existing libraries and utilities out there. – Thomas Matthews Apr 08 '11 at 20:35