Screen Scraping | Web Crawling

Question

I just have a few questions about the topic.

Can someone explain the advantages and disadvantages of using the following languages to write a scraper:

Java/Groovy

Perl

PhP

Selenium

Python

I'm also wondering what kind of issues to expect to face while scraping and perhaps how I should deal with it. For instance, I have come across fragment identities and haven't found a way to deal with it yet. (I'm using htmlunit)

Just looking for some pointers for those who know a bit about the topic.

score 1 · Answer 1 · answered Dec 16 '10 at 05:00

1

I recommend starting with Python + lxml. Mechanize is helpful sometimes too.

Websites that depend on JavaScript or cookies are harder to scrape, but most are straightforward.

Make sure to leave a few seconds between your requests to avoid being blocked.

answered Dec 16 '10 at 05:00

hoju

28,392
37
134
178

Thank you, I haven considered looking into Python solely for scraping purposes. I will have to take a look into you suggestions. Also thanks for the tip about running the requests on a timer. – StartingGroovy Dec 17 '10 at 21:54

score 1 · Answer 2 · answered Dec 17 '10 at 09:53

1

Consider looking at TestPlan. It has its own high-level language but you can also write modules in Java. It supports the Selenium back-end as well as HTMLUnit.

If you can give a specific problem (question) with your fragments then I can also answer that.

answered Dec 17 '10 at 09:53

edA-qa mort-ora-y

30,295
39
137
267

I think I am going to take up on your advice (seeing as I'm most familiar with Java/Groovy). I was also thinking about looking into Selenium; I have heard quite a few things about it. As for my specific problem: http://stackoverflow.com/questions/4320179/htmlunit-getbyxpath-returns-null I pin-pointed the issue in a comment on the answer. I haven't resolved that specific issue yet. I'm unaware of how to deal with the fragment identifier with HTMLUnit – StartingGroovy Dec 17 '10 at 21:55
Just wondering if you had a moment to look into my question? – StartingGroovy Dec 20 '10 at 21:54

score 1 · Accepted Answer · answered Dec 17 '10 at 15:52

1

The advantages/disadvantages are more related to the frameworks available than the programming language per-se.

If you need to scrap javascript/ajax websites htmlunit is one of the best options but if you want to use it directly you need a language running over a JVM (java, jython, clojure, etc). Another alternative (for javascript/ajax) is writing a Google Chrome add-on (easier than Firefox) or embeding a web browser within your application. A third alternative is using an automation tool like the ones at: http://openqa.org/ (e.g: Selenium, Watir).
If you don't need javascript/ajax support in my experience lxml is the best scraping library under CPython, mainly working with malformed html. Other html parsers doesn't work well in every circumstance.
Beyond (1) and (2), another important question is if you have a parallel crawling framework (if you need speed). (1), (2), (3) together are hard to find.

answered Dec 17 '10 at 15:52

sw.

3,240
2
33
43

Thank you for the detailed answer. You have cleared up quite a few questions I had floating around. Have you dealt with Selenium? I have been considering checking it out, but haven't made the step yet. I figured I would do a bit of research before jumping aboard. – StartingGroovy Dec 17 '10 at 21:58
No, I haven't used Selenium, I used Watir. In this context look at a past question about pros/cons: http://stackoverflow.com/questions/606550/watir-vs-selenium-vs-sahi I just need to add that Watir seemed slow to me. – sw. Dec 22 '10 at 19:08
Well thank you for the input, it's been quite helpful – StartingGroovy Dec 22 '10 at 20:42

Screen Scraping | Web Crawling

3 Answers3