0

I have an HTML file with very bad formatted code that I get from a website, I want to extract some very small pieces of information.

I am only interested in lines that start like this:

</form></td><td><a href="http://www.mysite.com/users/user897" class="username">   <b>user897</b></a></td></tr><tr><td>HouseA</td><td>2</td><td class="entriesTableRow-gamename">HouseA Type12 <span class="entriesTableRow-moredetails"></span></td><td>1 of 2</td><td>user123</td><td>10</td><td>

and I want to extract 3 fields:

  A:HouseA
  B:HouseA Type12
  C:user123
  D:10

I know I've seen people recommend HTML Agility Pack and lib2xml but I really don't think I need all that. My app is in C/C++.

I am already using getline to start reading lines, I am just not sure what's the best way to proceed. Thanks!

    std::ifstream  data("Home.html");
    std::string line;
    while(std::getline(data,line))
    {
        linenum++;
        std::stringstream  lineStream(line);
        std::string       user;
        if (strncmp(line.c_str(), "</form></td><td>",strlen("</form></td><td>")) == 0)
        {

            printf("found a wanted line in line:%d\n", linenum);
        }

    }
emge
  • 477
  • 2
  • 8
  • 20

1 Answers1

2

In the general case, an XML/HTML parser is likely the best way here, as it will be robust against differing input. (Whatever you do, don't use regexps!)

Update

However, if you're targetting specific input, as it seems that you're doing, you can use sscanf (as you suggest) or cin.read() or regexp to scan manually.

Just beware that this code can break at any moment that the HTML changes (even just with whitespace).

Therefore, my/our recommendation is to use a proper tool for the job. XML/HTML is not raw text, and should not be treated as such.

How about writing a python script instead? :)

Community
  • 1
  • 1
Macke
  • 24,812
  • 7
  • 82
  • 118
  • what small xml/html parser can I use? – emge Feb 17 '11 at 22:59
  • I've only used 'big' ones, like Xerces, the one in Qt or one built with boost::spirit (which requires boost, naturally). Neither would qualify as small, but each would work. If you dig around maybe you'll find sometihng useful. Maybe MFC has something you can use too? – Macke Feb 17 '11 at 23:08
  • is it possible to use something like sscanf or something similar? – emge Feb 17 '11 at 23:11
  • Well, if you are targetting some very specific lines and input, sscanf or cin >> char_value would work. – Macke Feb 17 '11 at 23:12
  • can you give me an example of how I would use sscanf or cin in this case? – emge Feb 17 '11 at 23:32