1

I am working on a simple client-server project. Client is written in Java, it sends key words to C++ server written under Linux and recives a list of URLs with best ranks ( depending on number of occurrences of key words ). Server's job is to go through some URLs in search of key words and return best-fitting URLs. And now the problem is that I have to parse HTML sites to find occurrences of key words, plus I need to extract links from visited page to search on them as well. And my question is what library can I use to do that? Remember only C++ linux libraries are suitable for me. There were some similar topics, so I tried to go through most of them, but some of libraries parse only html files and I don't want to download every site I visit, but parse it on the fly and just store it's rank and url. Some of them look a bit complicated to me - for instance firstly parsing HTML to XML or something else and then finally work on the results with C++. Is there something simple and sufficient to do what I need it to do? Any advise will be appreciated.

koleS
  • 1,263
  • 6
  • 30
  • 46

3 Answers3

1

I don't think regular expressions are appropriate for HTML parsing. I'm using libxml2, and I enjoy it very much - easy to use, portable and lightning fast.

Violet Giraffe
  • 32,368
  • 48
  • 194
  • 335
  • 1. There are a lot of implementations of this libarary on their site. Could you please give me a link to one, which is for C++? I am not really familiar with stuff like that. 2. You said you were using it, do you maybe have a snippet of code, that I could use as an example? ( especially if it is used with C++) Thanks ! – koleS Nov 24 '11 at 20:46
  • @koleS: Frankly, I'm new to it myself. I have downloaded snapshot: [link](ftp://xmlsoft.org/libxml2/libxml2-git-snapshot.tar.gz). I had no problem building it on Windows, and I think building for other platforms is no different. – Violet Giraffe Nov 24 '11 at 21:03
0

To get URLs from the web using C/C++ you could use the libcurl library. To parse URLs and other not too easy stuff from the site you can use a regex library.

Separating the HTML tags from the real content can also be done without the use of a library.

For more advanced stuff one could use Qt which offers classes such as QWebPage (which uses WebKit) that allows one to access the DOM-Model of the page and extract individual HTML objects (e.g. single cells of a table) rather easyly.

Community
  • 1
  • 1
trenki
  • 7,133
  • 7
  • 49
  • 61
-1

You can try xerces-c. It's a powerful library for xml parsing. It support xml reading on the fly, dom and sax parsing.

Alessandro Pezzato
  • 8,603
  • 5
  • 45
  • 63