How to have c++ webscraper scrape through html until it hits a float/int

Question

I am trying to use a C++ scraper in my ui to cipher through WSJ stock information to get some balance sheet info back, I have it for where it searchers for specific text in the page source ie "Pe/Ratio" and then i manually counted how many chars are in between it and the actual number on the website.

Here is the picture of the code

 // P/E Ratio
    size_t indexpeRatio = html.find("P/E Ratio ") + 116;
    string s_peRatio = html.substr(indexpeRatio, 5);
    peRatio = stod(s_peRatio);

After manually doing that it simply stores the number and I output it to my UI. My Issue is that sometimes the number of characters in between change depending on which company i choose to evaluate. I am wondering if there is a way to use the .find() function to find the "Pe/Ratio" then output the next float/int,

here is what the html looks like on the site

As of right now sometimes my ui will output parts of the html due to having to use a fixed number of chars

this is an example of my ui output when giving a smaller company to evaluate

Do you all have any recommendations I can use to fix this issue? Thank you in advance!

Please don't post images of text. Copy paste code and input/output — bolov, Apr 23 '20 at 00:36
use regex to match the info you need. [Oh no wait!](https://stackoverflow.com/a/1732454/2805305) Oh, nevermind. — bolov, Apr 23 '20 at 00:38

Evan Hendler · Accepted Answer · 2020-04-23T01:45:30.643

0

You could iterate through the characters.

Say you have string html:

#include<ctype.h>
#include<string>
using namespace std;

int main(){
     double peratio;
     string html;
     /*

     This is where you do your HTML scraping logic

     */
     size_t indexpeRatio=html.find("P/E Ratio");
     peratio=find_ratios(html.substr(indexpeRatio,strlen(html)-indexpeRatio));
}

double find_ratios(string html){
    int i=0;
    std::string output;
    bool wasInt=false,isInt=false;
    while(html[i]!='\0'&&!wasInt){
        if(isdigit(html[i]))
            isInt=true;
        if(isInt)
            if(html[i]!='.'&&!isdigit(html[i])){
                wasInt=true;
                isInt=false;
            }
            else output+=html[i];
        i++;
    }
    return stod(output);
}

edited Apr 23 '20 at 01:45

answered Apr 23 '20 at 00:57

Evan Hendler

342
1
12

Okay so in the while loop you do it while wasInt is true and html is not = to '\0', two questions, what does the \0 represent? And where are you getting [i] in the array, I know ive used this in the for loop but where did you define [i] at? – Lane Floyd Apr 23 '20 at 01:03
Sorry, forgot `i`. `'\0'` is null. Strings in C++ are null terminated, so I put that in to allow the loop to terminate if it hits the end of the string. Technically, you could get through the entire HTML string without having to use `.find()` at all. – Evan Hendler Apr 23 '20 at 01:05
Okay i appreciate the help! But in your code you are starting at the very top of the HTML document correct? How could I go about starting it directly after "Pe Ratio" in the code, sorry i am very new to all of this? – Lane Floyd Apr 23 '20 at 01:10
That's fine! I was very new at one point as well. I'll see if I can't give you something more complete. – Evan Hendler Apr 23 '20 at 01:13
Yep - again, this is not all inclusive, and I haven't compiled it. Just free balling here. Let me know if you have any issues. – Evan Hendler Apr 23 '20 at 01:23
The only issue I have with it right now is it is saying invalid operands to binary expression when using html!='\0'. Still trying to work on fixing it. – Lane Floyd Apr 23 '20 at 01:44
Sorry, my bad. Should be `html[i]`. `'\0'` is a char, and `html` is a string, i.e. an array of chars. `html[i]` will access that specific char for a char to char comparison – Evan Hendler Apr 23 '20 at 01:45
1

Ahh that actually makes perfect sense, thats why its invalid operand since html is the full string and the '\0' is meant for a null char, thank you again! – Lane Floyd Apr 23 '20 at 01:50

How to have c++ webscraper scrape through html until it hits a float/int

1 Answers1