So before few days I started learning C++. I'm writing a simple xHTML parser, which doesn't contain nested tags. For testing I have been using the following data: http://pastebin.com/bbhJHBdQ (around 10k chars). I need to parse data only between p, h2 and h3 tags. My goal is to parse the tags and its content into the following structure:
struct Node {
short tag; // p = 1, h2 = 2, h3 = 3
std::string data;
};
for example <p> asdasd </p>
will be parsed to tag = 1, string = "asdasd"
. I don't want to use third party libs and I'm trying to do speed optimizations.
Here is my code:
short tagDetect(char * ptr){
if (*ptr == '/') {
return 0;
}
if (*ptr == 'p') {
return 1;
}
if (*(ptr + 1) == '2')
return 2;
if (*(ptr + 1) == '3')
return 3;
return -1;
}
struct Node {
short tag;
std::string data;
Node(std::string input, short tagId) {
tag = tagId;
data = input;
}
};
int _tmain(int argc, _TCHAR* argv[])
{
std::string input = GetData(); // returns the pastebin content above
std::vector<Node> elems;
String::size_type pos = 0;
char pattern = '<';
int openPos;
short tagID, lastTag;
double duration;
clock_t start = clock();
for (int i = 0; i < 20000; i++) {
elems.clear();
pos = 0;
while ((pos = input.find(pattern, pos)) != std::string::npos) {
pos++;
tagID = tagDetect(&input[pos]);
switch (tagID) {
case 0:
if (tagID = tagDetect(&input[pos + 1]) == lastTag && pos - openPos > 10) {
elems.push_back(Node(input.substr(openPos + (lastTag > 1 ? 3 : 2), pos - openPos - (lastTag > 1 ? 3 : 2) - 1), lastTag));
}
break;
case 1:
case 2:
case 3:
openPos = pos;
lastTag = tagID;
break;
}
}
}
duration = (double)(clock() - start) / CLOCKS_PER_SEC;
printf("%2.1f seconds\n", duration);
}
My code is in loop in order to performance test my code. My data contain 10k chars.
I have noticed that the biggest "bottleneck" of my code is the substr. As presented above, the code finishes executing in 5.8 sec
. I noticed that if I reduce the strsub len to 10, the execution speed gets reduce to 0.4 sec
. If I replace the whole substr with ""
my code finishes in 0.1 sec
.
My questions are:
- How can I optimize the substr, because it's the main bottleneck to my code?
- Are there any other optimization I can make to my code?
I'm not sure if this question is fine for SO, but I'm pretty new in C++ and I don't have idea who to ask if my code is complete crap.
Full source code can be found here: http://pastebin.com/dhR5afuE