0

I am trying to extract a snippet out of a sourcecode from a website and now I want to delete all the spaces and tabs before the tags in each line. So I copied the string to a char and now I am checking each character with isspace (also tried '\t' and ' ') each line till there are some other chars like '<' doesn't matter which one while counting how much spaces and tabs there are. Subsequently I create another char and write the separator(line) to it but there I just skip the spaces (with [chars+i]). This method works pretty good but the problem is if there are more than 5 tabs then it just don't work properly. I have absolutely no idea where the fault is.

for(int i = 0;i < lines;i++){

    getline(codefile, buf);

    char *separator = new char[buf.size()+1];
    separator[buf.size()] = 0;
    memcpy(separator,buf.c_str(),buf.size());

    int chars = 0;

    for(int j = 0; j <= sizeof(separator); j++){

        if(isspace(separator[j])){
            chars++;    
        }
        else{
            break;
        }
    }

    char *newbuf= new char[buf.size()-chars+1];
    newbuf[buf.size()-chars] = 0;

    for(int k = 0; k <= buf.size()-chars+1; k++){
        newbuf[k] = separator[chars+k];
    }

    if(i > lcounter){
        cout << newbuf << i << endl;
    }

}

Here is the snippet of the sourcecode from the website. You can see it at the image tag, at the closing figure tag and the p tag. They have more than 5 tabs (sorry I had to censor it).

<div class="xxx">

   <article class="xxx" data-id="0">
    <a href="link" class="tile" style="background-image:url('x.jpg');background-position:left center"  data-more="&lt;a href=x" data-clicks="&lt;i class=&quot;fa fa-eye&quot;&gt;&lt;/i&gt;" data-teaserimg="x.jpg">
    <time datetime="2015">
        <span>2015</span>
    </time>
    <h1 class="title">
        <span>x</span>
    </h1>
    <div class="x">x</div>
    <div class="x">x</div>      
    <div class="x">
        <figure class="x">
            <img src="x.jpg" width="1" height="1" alt="">
        </figure>
        <p>
            <strong>x</strong>xxx
        </p>
    </div>
</a>

Sorry I can't post a picture and I hope it is understandable.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
  • 1
    It might be time to learn how to use a debugger, and how to step through the code line by line while observing variables and their values. – Some programmer dude Aug 05 '15 at 20:21
  • 2
    There are a few points that I find suspect though, and those all is about why you use dynamic memory allocation? Not once, but *twice*. Why not simply use `std::string` as well, and there are may example on how to *trim* (that's the term) leading whitespace, for example [this old answer here](http://stackoverflow.com/questions/216823/whats-the-best-way-to-trim-stdstring/217605#217605). – Some programmer dude Aug 05 '15 at 20:25
  • As for the reason behind your troubles, I'm guessing you're on a 32-bit system where pointers are 32 bits (*four bytes*). You really need to learn more about the [`sizeof` operator](http://en.cppreference.com/w/cpp/language/sizeof). – Some programmer dude Aug 05 '15 at 20:27

1 Answers1

0

sizeof(separator) should be strlen(separator)

sizeof is the size of the separator variable, not the length of the string. Since separator is a char* this is four bytes. Now do you see why your code doesn't work when you have more than five tabs?

And as others have pointed out there really is no reason to copy the string to the separator array. Why not just examine the characters where they are? isspace(buf[j]) works just as well as isspace(separator[j]).

john
  • 85,011
  • 4
  • 57
  • 81