17

How would I go about parsing all of the "a" html tags "href" properties on a page full of BAD html, in Qt?

y2k
  • 65,388
  • 27
  • 61
  • 86
  • 1
    Can you be more specific about what is bad about the HTML? Is it bad regularly, or is it complete garbage? You can't fix what's producing the HTML? – Bill Feb 01 '10 at 19:37
  • 2
    don't use regex... http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Malfist Feb 01 '10 at 19:41
  • 1
    It's a google search, Google's HTML is terrible. Errors found while checking this document as HTML5! Result: 50 Errors, 16 warning(s) – y2k Feb 01 '10 at 19:41
  • @JOSHUA: and those errors prevent Qt from parsing the HTML using QtWebKit? – Bill Feb 01 '10 at 20:19
  • 1
    I don't know how to use QtWebKit to do this and the only answer showing it doesn't work... I think the page has to load or something? – y2k Feb 01 '10 at 21:10

2 Answers2

19

I would use the builtin QtWebKit. Don't know how it does in terms of performance, but I think it should catch all "bad" HTML. Something like:

class MyPageLoader : public QObject
{
  Q_OBJECT

public:
  MyPageLoader();
  void loadPage(const QUrl&);

public slots:
  void replyFinished(bool);

private:
  QWebView* m_view;
};

MyPageLoader::MyPageLoader()
{
  m_view = new QWebView();

  connect(m_view, SIGNAL(loadFinished(bool)),
          this, SLOT(replyFinished(bool)));
}

void MyPageLoader::loadPage(const QUrl& url)
{
  m_view->load(url);
}

void MyPageLoader::replyFinished(bool ok)
{
  QWebElementCollection elements = m_view->page()->mainFrame()->findAllElements("a");

  foreach (QWebElement e, elements) {
    // Process element e
  }
}

To use the class

MyPageLoader loader;
loader.loadPage("http://www.example.com")

and then do whatever you like with the collection.

cschol
  • 12,799
  • 11
  • 66
  • 80
Jaro
  • 570
  • 6
  • 15
  • 1
    I cleaned this up and it didn't work... do I have to wait for the page to load or something? – y2k Feb 01 '10 at 21:10
  • 1
    @JOSHUA: I'd recommend waiting until you get the loadFinished(bool) signal, yes. (http://doc.trolltech.com/4.6/qwebview.html#loadFinished) – Bill Feb 01 '10 at 21:31
7


this question is already quite old. Nevertheless I hope this will help someone:

I wrote two small classes for Qt which I published under sourceforge. This will help you to access a html-file comparable you are used with XML.

Here you'll find the project:
http://sourceforge.net/projects/sgml-for-qt/
Here you'll find a help-system in the wiki.

Drewle

rubenvb
  • 74,642
  • 33
  • 187
  • 332
drewle
  • 71
  • 1
  • 2