How do I optimize string searching for c++ in a for loop?

Question

Is there anyway I can optimize this function to take less times for strings with half a million characters?

for (int i = start; i <= end; i++) {
        doc.body.push_back(html[i]);
        

        if (html[i] == '<') {
            if (html[i + 1] != '/') {
                Tag newTag;
                int start1 = html[i];
                int end1;

                for (int y = i; y <= end; y++) {
                    if (html[y] == '>') {
                        end1 = y;
                        break;
                    }
                }
                
                for (int y = start1 + 1; y <= end1; y++) {
                    if (html[y] == ' ' || html[y] == '>') {
                        break;
                    }
                    newTag.tagType.push_back(html[y]);
                }
                for (int y = start1; y <= end1; y++) {
                    newTag.openTag.push_back(html[y]);
                }
                if (newTag.tagType.find("area") != std::string::npos || newTag.tagType.find("base") != std::string::npos || newTag.tagType.find("br") != std::string::npos || newTag.tagType.find("col") != std::string::npos || newTag.tagType.find("command") != std::string::npos || newTag.tagType.find("embed") != std::string::npos || newTag.tagType.find("hr") != std::string::npos || newTag.tagType.find("img") != std::string::npos || newTag.tagType.find("input") != std::string::npos || newTag.tagType.find("keygen") != std::string::npos || newTag.tagType.find("link") != std::string::npos || newTag.tagType.find("meta") != std::string::npos || newTag.tagType.find("param") != std::string::npos || newTag.tagType.find("source") != std::string::npos || newTag.tagType.find("track") != std::string::npos || newTag.tagType.find("wbr") != std::string::npos) {
                    newTag.closeTag = "Null";
                }

                if (newTag.openTag.find("class=") != std::string::npos) {
                    int start1 = newTag.openTag.find("\"", newTag.openTag.find("class=")) + 1;
                    int end1 = newTag.openTag.find("\"", start1);

                    for (int y = start1; y < end1; y++) {
                        if (html[y] != ' ') {
                            newTag.tagClass.back().push_back(html[y]);
                        }
                        else {
                            newTag.tagClass.push_back("");
                        }
                    }
                }

                if (newTag.openTag.find("id=") != std::string::npos) {
                    int start1 = newTag.openTag.find("\"", newTag.openTag.find("id=")) + 1;
                    int end1 = newTag.openTag.find("\"", start1);

                    for (int y = start1; y < end1; y++) {
                        if (html[y] != ' ') {
                            newTag.tagID.back().push_back(html[y]);
                        }
                        else {
                            newTag.tagID.push_back("");
                        }
                    }

                }


                page.tags.push_back(newTag);
            }
            else {


                int end1;
                for (int y = i; y <= stringSize; y++) {
                    if (html[y] == '>') {
                        end1 = y;
                        break;
                    }
                }


                //gets everything within the closing tag
                std::string storeClose;
                for (int y = i; y <= end1; y++) {
                    storeClose.push_back(html[y]);
                }
                for (int y = page.tags.size() - 1; y >= 0; y--) {
                    if (storeClose.find(page.tags[y].tagType) != std::string::npos) {
                        page.tags[y].closeTag = storeClose;
                    }
                }

            }
        }
    }

I timed how long it took with the chrono library and it took 16 minutes for a string with the length of 300000 characters! This is supposed to be parsing a html document downloaded from the web and its mostly functional. For shorter pages its almost instant but as soon as I reach the higher number the time it takes its exponentially higher!

*I timed how long it took with the chrono library* -- You should specify the compiler flags you used to build the application. Are you running an optimized build, or a non-optimized "debug" build? If it is a non-optimized build, then the timings you are reporting are meaningless. — PaulMcKenzie, Feb 18 '22 at 17:47
You might want to consider using a library to handle the HTML parsing for you. It will most certainly be well-optimized, and hopefully also will have less chances of bugs, and require less maintenance from you. — Some programmer dude, Feb 18 '22 at 17:50
*but as soon as I reach the higher number the time it takes its exponentially higher!* -- Also, have you considered that the code you wrote may be totally naive for doing big string searches, and that there is little to no chance of taking your existing code and tweaking it here or there to get it to work quickly? — PaulMcKenzie, Feb 18 '22 at 17:51
First, you should really refactor it into smaller functions and replace some blocks with appropriate STL algorithms like std::find and avoid copying chars by hand, `std::string` has handy constructors. Also `i` should maybe be incremented based on what you find. — Quimby, Feb 18 '22 at 17:54
Side note: `<=` in the exit condition of a `for` loop is so often a bug that any time you see it, you should inspect it care fully to ensure the program doesn't go marching out of bounds. I lack sufficient context to tell you whether or not it's bad here, so you'll have to do the checking. Also know that when you do things that are frequently wrong, even when you do it right, will cause problems. Code review is slower because your work will be checked more diligently (often not bad thing), and other folk working with your code may try to "fix it" with oft-comical results. — user4581301, Feb 18 '22 at 17:54
I agree with @Someprogrammerdude -- find a library to do this work. It isn't as trivial as you may believe. You will either miss edge cases, or have bugs. By the time you figure all of that out, you've spent weeks, maybe months, on getting something to work that has already been written, debugged, and optimized by another library. — PaulMcKenzie, Feb 18 '22 at 18:00
You might consider storing elements of the HTML document as `std::string_view`s rather than `std::string` copies. — Galik, Feb 18 '22 at 18:34
@PaulMcKenzie i am compiling it using visual studio code 17 debug mode. Also finding another library completely gets rid of the purpose of me doing this because I want to make my own html parser for the experience. — Joshua Rodriguez, Feb 18 '22 at 19:27
I just checked running the program with the release version and it was reduced greatly to 13 seconds but even then its too long — Joshua Rodriguez, Feb 18 '22 at 19:31
@JoshuaRodriguez *Also finding another library completely gets rid of the purpose* -- No it doesn't do this. What it does do is it gives you an idea of how the experienced programmer accomplishes what you're trying to accomplish. Just because you want to do it yourself doesn't mean you can't look to see how others did it. The source code is available for existing C++ parsers, so take a look to see how others have done it. That is what any good programmer would do. If you find that a C++ library parses the same data in 2 seconds, wouldn't you be curious to see how they did it? — PaulMcKenzie, Feb 18 '22 at 19:48
@PaulMcKenzie my bad I thought what you meant to say was to just stop doing it and just use a pre existing library — Joshua Rodriguez, Feb 18 '22 at 19:53

score 0 · Answer 1 · answered Feb 19 '22 at 12:51

I answered already yesterday the same question.

But first. Please compile your source code with all warnings enabled. There are tons of compiler warnings that you must resolve.

Additionally: In my opinion you have a hard bug in line int start1 = html[i];. I guess this should be int start1 = i;

Then, because you did not create a minimum reproducable example, I stubbed your code to test it.

Then I copied the HTML source from this file from stackoverflow and copied it in a local file 4 times to get a file size of 1.9MByte.

Then I let it run on my machine in below 4 seconds (including reading the 2MByte) from disk.

I used the profiler to check, where the time is burned. Several rounds.

The timed is mainly burned in string.find, string assignment and vector assignments.

The reason is not that the find-function is slow, but your excessive and unnecessary use of it.

Here you must work.

Look at the lines:

if (newTag.tagType.find("area") != std::string::npos || newTag.tagType.find("base") != std::string::npos || newTag.tagType.find("br") != std::string::npos || newTag.tagType.find("col") != std::string::npos || newTag.tagType.find("command") != std::string::npos || newTag.tagType.find("embed") != std::string::npos || newTag.tagType.find("hr") != std::string::npos || newTag.tagType.find("img") != std::string::npos || newTag.tagType.find("input") != std::string::npos || newTag.tagType.find("keygen") != std::string::npos || newTag.tagType.find("link") != std::string::npos || newTag.tagType.find("meta") != std::string::npos || newTag.tagType.find("param") != std::string::npos || newTag.tagType.find("source") != std::string::npos || newTag.tagType.find("track") != std::string::npos || newTag.tagType.find("wbr") != std::string::npos) {
                    newTag.closeTag = "Null";
                }

That is of course very inefficient.

Anyway. Study you code and refactor it. The design is important.

Please see your stubbed code:

#include <iostream>
#include <fstream>
#include <string>
#include <vector>

struct Doc {
    std::string body{};
};
struct Tag {
    std::vector<std::string> tagID{};
    std::string tagType{};
    std::string openTag{};
    std::string closeTag{};
    std::vector<std::string> tagClass{};
};
struct Page {
    std::vector<Tag> tags;
};

int main() {

    Doc doc{};
    Page page{};

    std::vector<char> html{};
    html.reserve(200000000);

    if (std::ifstream ifs("r:\\example.htm"); ifs) {
        
        std::copy(std::istreambuf_iterator<char>(ifs), {}, std::back_inserter(html));

        int start = 0;
        int end = static_cast<int>(html.size());

        for (int i = start; i <= end; i++) {
            doc.body.push_back(html[i]);


            if (html[i] == '<') {
                if (html[i + 1] != '/') {
                    Tag newTag;
                    int start1 = i;//html[i];
                    int end1;

                    for (int y = i; y <= end; y++) {
                        if (html[y] == '>') {
                            end1 = y;
                            break;
                        }
                    }

                    for (int y = start1 + 1; y <= end1; y++) {
                        if (html[y] == ' ' || html[y] == '>') {
                            break;
                        }
                        newTag.tagType.push_back(html[y]);
                    }
                    for (int y = start1; y <= end1; y++) {
                        newTag.openTag.push_back(html[y]);
                    }
                    if (newTag.tagType.find("area") != std::string::npos || newTag.tagType.find("base") != std::string::npos || newTag.tagType.find("br") != std::string::npos || newTag.tagType.find("col") != std::string::npos || newTag.tagType.find("command") != std::string::npos || newTag.tagType.find("embed") != std::string::npos || newTag.tagType.find("hr") != std::string::npos || newTag.tagType.find("img") != std::string::npos || newTag.tagType.find("input") != std::string::npos || newTag.tagType.find("keygen") != std::string::npos || newTag.tagType.find("link") != std::string::npos || newTag.tagType.find("meta") != std::string::npos || newTag.tagType.find("param") != std::string::npos || newTag.tagType.find("source") != std::string::npos || newTag.tagType.find("track") != std::string::npos || newTag.tagType.find("wbr") != std::string::npos) {
                        newTag.closeTag = "Null";
                    }
                    if (newTag.openTag.find("class=") != std::string::npos) {
                        int start1 = newTag.openTag.find("\"", newTag.openTag.find("class=")) + 1;
                        int end1 = newTag.openTag.find("\"", start1);

                        for (int y = start1; y < end1; y++) {
                            if (newTag.tagClass.empty()) newTag.tagClass.push_back({});
                            if (html[y] != ' ') {
                                if (newTag.tagClass.empty()) newTag.tagClass.push_back({});   // ***********************
                                newTag.tagClass.back().push_back(html[y]);
                            }
                            else {
                                newTag.tagClass.push_back("");
                            }
                        }
                    }
                    if (newTag.openTag.find("id=") != std::string::npos) {
                        int start1 = newTag.openTag.find("\"", newTag.openTag.find("id=")) + 1;
                        int end1 = newTag.openTag.find("\"", start1);
                        if (newTag.tagID.empty()) newTag.tagID.push_back({});
                        for (int y = start1; y < end1; y++) {
                            if (html[y] != ' ') {
                                newTag.tagID.back().push_back(html[y]);
                            }
                            else {
                                newTag.tagID.push_back("");
                            }
                        }

                    }
                    page.tags.push_back(newTag);
                }
                else {


                    int end1;
                    for (int y = i; y <= html.size(); y++) {
                        if (html[y] == '>') {
                            end1 = y;
                            break;
                        }
                    }


                    //gets everything within the closing tag
                    std::string storeClose;
                    for (int y = i; y <= end1; y++) {
                        storeClose.push_back(html[y]);
                    }
                    for (int y = page.tags.size() - 1; y >= 0; y--) {
                        if (storeClose.find(page.tags[y].tagType) != std::string::npos) {
                            page.tags[y].closeTag = storeClose;
                        }
                    }
                }
            }
        }
    }
}

And one output of the profile runs:

The test has been performed on a 12 years old Windows 7 machine.

Thank you for the suggestions I am looking into what you have told me thank you! — Joshua Rodriguez, Feb 20 '22 at 05:51
Also yes I realized the line int start1 = html[i] was a bug and not intentional I had it fixed after copying the code — Joshua Rodriguez, Feb 20 '22 at 05:52

How do I optimize string searching for c++ in a for loop?

1 Answers1