13

I am planning to write a webcrawler for a NLP project, that reads in the thread structure of a forum everytime in a specific interval and parses each thread with new content. Via regular expressions, the author, the date and the content of new posts is extracted. The result is then stored in a database.

The language and plattform used for the crawler have to match the following criteria:

  • easily scalable on multiple cores and cpus
  • suited for high I/O loads
  • fast regular expression matching
  • easily to maintain/few operational overhead

After some research I think Erlang might be a fitting candidate, but I read it's not very good at string processing (and so regular expression matching). Neither do I have any expirience about the maintenance factor.

Is Erlang a good technology for the scenario described above? And if not, what would be a good alternative?

Thomas
  • 10,289
  • 13
  • 39
  • 55
  • 1
    This is probably better asked on http://programmers.stackexchange.com ; it falls under "not constructive" here IMHO – Brian Roach Feb 05 '12 at 19:21
  • Your criteria have at least as much to do with the overall design and architecture as the language. You can build scalable webcrawlers in Erlang, Python, Java, whatever. It also depends on your current programming language experience and your timescales. – DNA Feb 05 '12 at 20:01
  • I would really like to use Erlang for this projects because it might be the best fit from what I read so far. My question is, if the bad regex matching makes it a no-go for this project and how high the operating expense (esp. for maintenance) would be for this in practice. – Thomas Feb 05 '12 at 20:12
  • Erlang is (to me), the best choice for a crawler. Its possible andn easier to spawn and the distribute the activities of parsing and crawling across as many VMs instances or Physical Machines as possible – Muzaaya Joshua Feb 06 '12 at 06:03

3 Answers3

9

I am also evaluating erlang for use as a web crawler and it looks good so far.

There are lots of existing helpful modules: HTML parser, HTTP client, XPath, regex, cache.

And other people are interested in the same use case, so you can learn from them.

However if this is just a one off project I recommend Python / Ruby / Perl because it will be easier to get started with.

hoju
  • 28,392
  • 37
  • 134
  • 178
  • Thank you for the answer and the links provided. The project aims to last longer, the additional time costs for learning Erlang are neglibible (in fact, it would be fun to learn a new language :) ) – Thomas Feb 06 '12 at 16:38
  • 1
    yeah it's fun to learn something so different. I have been working through this tutorial: http://learnyousomeerlang.com/ – hoju Feb 06 '12 at 21:03
4

If you're familiar and comfortable with erlang then I'd stick with it if I were you, although I'm not familiar with erlang. With that noted, I'll give you some pointers:

  1. Don't use regular expressions to parse HTML, use XPATH instead.
    HTML, while structured, is still quite difficult to parse in the wild and regular expressions are fairly slow and unreliable for parsing HTML.
  2. Determine what your crawler architecture is going to be and what is your re-visit policy.
  3. Find the best selection policy for you and implement it.

A web crawler is a fairly complex system to build and you have to be concerned about speed, performance, scalability and concurrency. Some of the most notable crawlers are written in C++ and Java, but I have not heard of any crawlers written in erlang.

Community
  • 1
  • 1
Kiril
  • 39,672
  • 31
  • 167
  • 226
  • Thank you for the suggestion. My parser has to be specifically written for the forum I am going to parse and I need only very small parts of the whole xhtml tree, so regexes could be much cheaper than Xpath. I think the concerns you raise make Erlang a perfect fit, depending on the possible bottleneck of regex matching. – Thomas Feb 05 '12 at 23:24
  • How much crawling do you expect to do? How many pages per hour? If you're not doing too many, then regex is not going to kill you, but it will still be less reliable and slower than xpath, imho. Regex is also a pain to maintain, debug and understand, so I try to avoid it at all costs. That's just my preference tho. If you tell me some more about the performance requirements, then I can also recommend some other things like some papers on cralwer architecture. – Kiril Feb 06 '12 at 07:24
  • you are correct that regex's are less reliable, however they are generally faster than XPath – hoju Feb 06 '12 at 12:02
3

Erlang is fine for this. Its regex library delegates (nearly all) work to PCRE, which should be fast enough. But avoid strings and use binaries instead! They both use a lot less memory and are faster to translate to C strings.

Alexey Romanov
  • 167,066
  • 35
  • 309
  • 487