How to simply parse html references

Question

how it is possible ,to simply parse html links. For example I receive http response containing http. In which you have links to other files, which need to be downloaded for example jpgs, css files,js files. What is the simplest way to parse all this references.

If you need it in c++, then tag it c++ the next time... oh and you should **totally** try regex: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Ivo Wetzel, Jan 06 '11 at 14:49
@ivo, you suggest regex and point to the bane of parsing html with regex .. *hmmm..*, are you missing a **not** in there ? — Gabriele Petrioli, Jan 06 '11 at 14:51
@Ivo - and those who _don't_ follow the link? Do you think _they_ will get the sarcasm? — Oded, Jan 06 '11 at 14:54
@Ivo, wasn't sure but truth is that new members might not get it ... (*you did not have a single smiley .. :p*) — Gabriele Petrioli, Jan 06 '11 at 15:50

score 1 · Answer 1 · edited May 23 '17 at 12:26

1

Use an HTML parser for your platform/language.

There are some recommendations for c++ ones here.

Once you have a parsed document, you will need to look at each src and href in it - you will also need to remember the base tag, if one exists and add logic for external, relative and absolute paths.

edited May 23 '17 at 12:26

Community

1
1

answered Jan 06 '11 at 14:56

Oded

489,969
99
883
1,009

How to simply parse html references

1 Answers1