0

I have a string containing an HTML page downloaded via WinHttpReadData. The string is a simple char*.
I've been trying to figure a way to extract only the URL's that are on that page. To give you an example, imagine you are searching google for the word WinHTTP and you are presented with an HTML page full of links. I need now to check each link, extract it and save it to a file.

I tried searching for HREF, http:// and other keywords and then try to extract the string all the way to the </a> but it's not really working. It'll be nice also to get the description out that URL (like <a href="http://someurl.com/somepage.html">some text</a> get some text) but it's not as important as the URL itself.

The tricky thing here is that I cannot use 3rd party libraries since I don't want to have to deal with licenses and the like.

Any ideas on how to do this? Does WinHTTP provide a way to do this? in C (not C++)?

Thanks for the help

Mr Aleph
  • 1,887
  • 5
  • 28
  • 44
  • "since I don't want to have to deal with licenses and the like" - just find an HTML parser that is licensed under the LGPL. Then you can basically use it without caring about anything as long as you don't modify the library itself – ThiefMaster Mar 01 '11 at 14:38
  • Already tried, couldn't find one that is either BSD, MOZILLA or LGPL. Thanks tho. – Mr Aleph Mar 01 '11 at 14:49

1 Answers1

0

Maybe you should go for the PCRE C API (Available on PCRE site)

The regex you'll need will be like :

<a.*?href=[""'](?<url>.*?)[""'].*?>(?<name>.*?)</a>

This should map too group <url> and <name> within the group structure.

M'vy
  • 5,696
  • 2
  • 30
  • 43
  • I'd also use Regex. If you are using C++0X it has built-in support for it in the STL. – RedX Mar 01 '11 at 14:39
  • Thanks for the tip. I just downloaded PCRE but it's a mess to figure what to use from all the files that are in the package. Would you mind pointing to the files I need? or I need them all? – Mr Aleph Mar 01 '11 at 14:46
  • I don't want to sound rude, but README should be a good start. I bet the sources compiles with make or cmake into a library. Then you have to import the header file that describes the library external interface and you link when building. Look also for documentation on the website or with google. You'll surely find examples. Sorry to ne be more precise, but I do not use the library at the moment. – M'vy Mar 01 '11 at 15:59
  • Thanks. unfortunately I am not using GCC or other compilers where I can use the makefile provided with the library. I'm using visual studio... I guess I will not be using this then. Thank again – Mr Aleph Mar 01 '11 at 16:05
  • Check also the `NON-UNIX-USE` file in that case. It's talking about cmake use for windows users. They also talk about Visual Studio in some part. BTW, you welcome. – M'vy Mar 01 '11 at 16:09