2

I would like to make a small bot in order to automatically and periodontally surf on a few partner website. This would save several hours to a lot of employees here.

The bot must be able to :

  • connect to this website, on some of them log itself as a user, access and parse a particular information on the website.
  • The bot must be integrated to our website and change it's settings (used user…) with data of our website. Eventually it must sum up the parse information.
  • Preferably this operation must be done from the client side, not on the server.

I tried dart last month and loved it… I would like to do it in dart.

But I am a bit lost : Can I use a Document class object for each website I want to parse? Could be headless or should I use the chrome/dartium api to controle the webbrowser (i'd like to avoid this) ?

I've been reading this thread : https://groups.google.com/a/dartlang.org/forum/?fromgroups=#!searchin/misc/crawler/misc/TkUYKZXjoEg/Lj5uoH3vPgIJ Does using https://github.com/dart-lang/html5lib is a good idea for my case?

Lee Taylor
  • 7,761
  • 16
  • 33
  • 49
Jhon _
  • 43
  • 4

1 Answers1

3

There are two parts to this.

  1. Get the page from the remote site.
  2. Read the page into a class that you can parse.

For the first part, if you are planning on running this client-side, you are likely to run into cross-site issues, in that your page, served from server X, cannot request pages from server Y, unless the correct headers are set.

See: CORS with Dart, how do I get it to work? and Dart application and cross domain policy or the site in question needs to be returning the correct CORS headers.

Assuming that you can actually get the pages from the remote site client-side, you can use HttpRequest to retrieve the actual content:

// snippet of code...
new HttpRequest.get("http://www.example.com", (req) {
  // process the req.responseText
});

You can also use HttpRequest.getWithCredentials. If the site has some custom login, then you will probably problems (as you will likely be having to Http POST the username and password from your site into their server -

This is when the second part comes in. You can process your HTML using the DocumentFragment.html(...) constructor, which gives you a nodes collection that you can iterate and recurse through. The example below shows this for a static block of html, but you could use the data returned from the HttpRequest above.

import 'dart:html';

void main() {
  var d = new DocumentFragment.html("""
    <html>
      <head></head>
      <body>Foo</body>
    </html>
  """);

  // print the content of the top level nods
  d.nodes.forEach((node) => print(node.text)); // prints "Foo"
  // real-world - use recursion to go down the hierarchy.

}

I'm guessing (not having written a spider before) that you'd be wanting to pull out specific tags at specific locations / depths to sum as your results, and also add urls in <a> hyperlinks to a queue that your bot will navigate into.

Community
  • 1
  • 1
Chris Buckett
  • 13,738
  • 5
  • 39
  • 46
  • Thank you for this long and detailed answer. – Jhon _ Jan 04 '13 at 06:55
  • Yes I will very probably need to make http POST. So this is a problem that can not be easily solved? (due to CORS?) You talked about iterates on node. But will it be possible to use a dart query? as it is in this DocumentFragment http://api.dartlang.org/docs/bleeding_edge/dart_html/DocumentFragment.html – Jhon _ Jan 04 '13 at 07:08
  • I am asking those question but I am not sure I completely understand CORS's limits. This link : https://developer.mozilla.org/en-US/docs/HTTP/Access_control_CORS from you second link. Seems really clear. I am still reading. – Jhon _ Jan 04 '13 at 07:09
  • Basically, if your site served from foo.com requests a page or other data from bar.com, then bar.com must either provide a header explicitly allowing foo.com to request data, or provide a header allowing any site to request data (eg: Access-Control-Allow-Origin: *). Another workaround might be [JSONP](http://blog.sethladd.com/2012/03/jsonp-with-dart.html) - although you're not requesting JSON, the same method will probably still work. Note - these are not restrictions in Dart, but in the browser security model. – Chris Buckett Jan 04 '13 at 08:07