-1

I am making an html parser and it is coming out great. I can get tags their classes and id's. Its also really simple to get all other attributes.

The issue is that it is rather slow and I am struggling to make it faster. I have tried removing things that aren't as necessary including more If statements to reduce the need to check other code. I did some research and found out that find() in C++ is rather slow for larger strings. I have done test using websites like example.com and parsed it. It takes 3 seconds which is pretty slow but somewhat bearable then I tried more complex sites and it takes about 8 minutes which is ridiculous. This is the first time I do something like this.

Is there a way to find a substring within a string much faster than using .find()?

I know that there is definitely more I can do like reducing amount of allocations which I am looking to do but if you have any suggestions it would be greatly appreciated!

Example

std::string test = "A string that has half a million characters!";
std::cout << test.find("half") << std::endl;
  • Does this answer your question? [Is there alternative str.find in c++?](https://stackoverflow.com/questions/56266557/is-there-alternative-str-find-in-c) –  Feb 18 '22 at 00:39
  • What are you using find `for` in the parser? It's a pretty simple and blunt instrument, not that well suited to seeking keys in a complex HTML document. [Have you considered using regex?](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – user4581301 Feb 18 '22 at 00:51
  • 1
    @OrkhanAliyev — no, it doesn’t. – Pete Becker Feb 18 '22 at 00:54
  • @user4581301 the for loops are used to iterate through the character within the string and add them to the innerHTML of the tag and to check whether or not it starts a new tag within a tag – FlimsyEar665012 Feb 18 '22 at 01:04
  • @PeteBecker If so, I don't think that standard libraries will help you without a strong algorithm. –  Feb 18 '22 at 01:11
  • @FlimsyEar665012 It is hard to help you find something faster, when we can't see what you are already using. For all we know, you are simply not using `find()` effectively/correctly to begin with. Can you show some of your parsing code? In any case, the first thing I would suggest is investigate using [`std::string_view`](https://en.cppreference.com/w/cpp/string/basic_string_view), which will allow you to create and search substrings without allocating any new memory. That way, you end up with a bunch of tokens and innertexts and what-not that are just pointers into the original `std::string` – Remy Lebeau Feb 18 '22 at 01:29

1 Answers1

0

The problem is not with std::string.find(). But elsewhere in your code.

For your information I made the below test with a 100M long string:

#include <iostream>
#include <string>
#include <algorithm>
#include <chrono>

constexpr size_t StringSize = 100'000'000u;

int main() {

    std::string longString(StringSize, ' ');
    std::string stringToSearch = "abcdefghijklmnopq";
    std::copy(stringToSearch.begin(), stringToSearch.end(), longString.end() - stringToSearch.length() - 1);

    auto startTime = std::chrono::system_clock::now();
       
    size_t pos = longString.find(stringToSearch, 0u);
    std::cout << pos << '\n';

    // End of time measurement
    auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now() - startTime);
    std::cout << "\nReading and splitting. Duration: " << elapsed.count() << " ms\n";
}

which runs in 10ms on my machine.

The problem is in the design of your solution. So, in the code that you do not show . . .

A M
  • 14,694
  • 5
  • 19
  • 44