I am planning to write a webcrawler for a NLP project, that reads in the thread structure of a forum everytime in a specific interval and parses each thread with new content. Via regular expressions, the author, the date and the content of new posts is extracted. The result is then stored in a database.
The language and plattform used for the crawler have to match the following criteria:
- easily scalable on multiple cores and cpus
- suited for high I/O loads
- fast regular expression matching
- easily to maintain/few operational overhead
After some research I think Erlang might be a fitting candidate, but I read it's not very good at string processing (and so regular expression matching). Neither do I have any expirience about the maintenance factor.
Is Erlang a good technology for the scenario described above? And if not, what would be a good alternative?