0

My question is this, I have the HTML body of a website inside a std::string and now I was to extract all the URLs that are inside into a std::vector< std::string >. I know how to use regex to check if a string is a URL but I don't know how to extract all the URLs into a std::vector.

Can somebody point me to the right direction?

Marvin Klar
  • 1,869
  • 3
  • 14
  • 32
jdoe1010
  • 103
  • 1
  • 1
  • 5
  • 2
    You do not want to use regex on HTML. [The Pony will get you.](https://stackoverflow.com/a/1732454/4581301) – user4581301 Mar 31 '18 at 23:30
  • @Neil What are you talking about? –  Mar 31 '18 at 23:34
  • @user4581301 So, what do you suggest otherwise? – jdoe1010 Mar 31 '18 at 23:37
  • 2
    Have you tried googling "c++ html parsing" – quant Mar 31 '18 at 23:43
  • 1
    You might be able to write a regex that extracts the href value from simple `a` tag links, but think about all the special cases you'd have to handle -- imperfect HTML, Javascript links, `a` tags nested inside `pre` elements, relative paths... the list goes on. Better to let a good HTML parser do the dirty work for you. – MrEricSir Apr 01 '18 at 00:29

2 Answers2

2

To meaningfully extract data from a HTML document, you need to parse the HTML. The HTML specification describes the syntax of HTML (note that there are older versions of HTML as well, so be sure to parse according to the version in which your HTML document was written). The specification has a very useful section titled Parsing HTML documents, which will be very relevant to writing a parser.

The result of parsing a HTML document should be a Document Object Model tree. You can traverse this tree to find the URL's that you're looking for.

eerorika
  • 232,697
  • 12
  • 197
  • 326
1

So using a good markup language reader such as Boost Property Tree would always be advisable over trying to process by hand.

But hypothetically let's say that you had developed a bullet proof regex for parsing . Because we don't want jealousy to arise from the other victims who've tried to cross the treacherous minefield of markup language processing via regex; we'll just call your regex: regex re and we'll say that it's 1st capture is the URL that you want to store in this vector.

With such a legendary regex the only other thing you'll need is regex_token_iterator. Given the input to process was, const sting text you could simply do this:

vector<string> foo { sregex_token_iterator(cbegin(text), cend(text), re, 1), sregex_token_iterator() }
Jonathan Mee
  • 37,899
  • 23
  • 129
  • 288