4

This is a homework, thus I hope you guys don't give me the direct answers/code, but guide me to the solution.

My problem is, I have this XXX.html file, inside have thousands of codes. But what I need is to extract this portion:

<html>
...
<table>
    <thead>
        <tr>
            <th class="xxx">xxx</th>
            <th>xxx</th>                       <th>xxx</th>         </tr>
    </thead>
    <tbody>
        <tr class=xxx>
        <td class="xxx"><a href="xxx" >ZZZ ZZ ZZZ</a></td>
<td>ZZZZ</td>        <td class="xxx">ZZZZ</td>    </tr>    <tr class=xxx>
<td class="xxx"><a href="xxx" >ZZZ ZZ ZZZ</a></td>
<td>ZZZZ</td>        <td class="xxx">ZZZZ</td>    </tr>    <tr class=xxx>
<td class="xxxx"><a href="xxxx" >ZZZ ZZ ZZZ</a></td>
<td>ZZZZ</td>        <td class="xxxx">zzzz</td>    </tr>    <tr class=xxx>
<td class="xxx"><a href="xxxx" >ZZZ ZZ ZZZ</a></td>
    ... and so on

This is my current codes so far:

// after open the file
while(!fileOpened.eof()){
        getline(fileOpened, reader);
        if(reader.find("ZZZ")){
            cout << reader << endl;
        }
    }

The "reader" is a string variable that I want to hold for each line of the HTML file. If the value of ZZZZ, as I need to get live, the value will change, what method should I use instead of using "find" method? (I am really sorry, for not mention this part)

But instead of display the value that I want, it display the some others portion of the html file. Why? Is my method wrong? If my method is wrong, how do I extract the ZZZZZ value?

Cœur
  • 37,241
  • 25
  • 195
  • 267
  • What is `reader` and what is `readLine`. Shouldn't it be one and the same variable? – Draco Ater Oct 24 '10 at 10:06
  • Hi Draco, I edited my question, it should be the same. Its the same variable – cpp_learner Oct 24 '10 at 10:13
  • Without even looking at the question, you get a `+1` from me for "I hope you guys dont give me the direct answers/code". And I'd give you a `+10` if I could. – sbi Oct 24 '10 at 11:51
  • Line breaks might be freely put into HTML. How do you know what you're looking for really is all in one line? (You might have to write a simple HTML parser to do what you want.) – sbi Oct 24 '10 at 11:54
  • Well.. I dont understand what you are trying to extract: all lines containing 'ZZZZ' or ONLY ZZ's i.e. combination of 2 or more Z? And +1 for mentioning that you do not need a code, but the method. Nice. – SkypeMeSM Oct 24 '10 at 12:25
  • [possible duplicate of...](https://stackoverflow.com/a/1732454/2757035) – underscore_d Apr 05 '18 at 16:10

3 Answers3

3

std::string::find does not return a boolean value. It returns an index into the string where the substring match occurs if it is successful, else it returns std::string::npos.

So you would want to say:

    if (reader.find("ZZZ") != std::string::npos){
        cout << reader << endl;
    }
Charles Salvia
  • 52,325
  • 13
  • 128
  • 140
0

In general using string matching just won't work to extract values from an HTML file. A proper HTML parser would be required -- they are available for C++ as standard code.

Otherwise I'd suggest using a regex library (boost::regex until C++0x comes out). You'll be able to write better expressions to capture the part of the file you are interested in.

Reading by line probably won't work since an HTML file could be one large line. Outputing then each line you find will simply emit the entire file. Thus try the regexes and look for small sections of the code and output those. The regex library will have a "match all" command (I forgot the exact name).

edA-qa mort-ora-y
  • 30,295
  • 39
  • 137
  • 267
  • it looks like lots of things to study if I use the boost::regex. I am just starting to learn C++, it might take some time to implement it. Is there any shorter/easier way for beginner? – cpp_learner Oct 24 '10 at 10:31
  • the regular expression that took me weeks/months to master it =( – cpp_learner Oct 24 '10 at 10:36
  • Well, the HTML parsers are harder to use than regex. But I can say that learning regex will be well worth your time. They come up again and again and again. – edA-qa mort-ora-y Oct 24 '10 at 20:59
0

The skeleton code for reading lines from a file should look like this:

if( !file.good() )
  throw "opening file failed!";

for(;;) {
  std::string line;
  std::getline(file, line);
  if( !file.good() )
    break;
  // reading succeeded, process line
}

if(!file.eof())
  // error before reaching EOF

(That funny looking loop is one that checks for the ending condition in the middle of the loop. There is not such thing in C++, so you have to use an endless loop with a break in the middle.)

However, as I said in a comment to your question, reading HTML code line-by-line isn't necessarily useful, as HTML doesn't rely on specific whitespaces.

sbi
  • 219,715
  • 46
  • 258
  • 445
  • why not just `while (std::getline(file, line))`? – Zereges Apr 05 '18 at 16:09
  • Or `for(std::string line; std::getline(file, line);)` which still scopes `line` correctly. – Quentin Apr 05 '18 at 16:15
  • @Zereges That exposes `line` to the enclosing scope. I have learned to avoid that. – sbi Apr 06 '18 at 12:09
  • @Quentin: Unfortunately, `bool(stream)` is not the same as `stream.good()`. (Which, in turn, is _not_ the opposite of `stream.bad()`. Streams are a mess.) – sbi Apr 06 '18 at 12:11
  • From [that insane truth table at the bottom](http://en.cppreference.com/w/cpp/io/basic_ios/operator_bool), I see that only the EOF condition changes between `operator bool` and `good()`. But we *do* want to detect EOF here, don't we? – Quentin Apr 06 '18 at 12:17