Getting potentially large amounts of data from a website: Should I use Scrapy or urllib2?

Question

I'm not new to programming—but am (very) new to web-scraping. I'd like to get data from this website in this manner:

Get the team-data from the given URL and store it in some text file.
"Click" the links of each of the team members and store that data in some other text file.
Click various other specific links and store data in its own separate text file.

Again, I'm quite new to this. I have tried opening the specified website with urllib2 (in hopes of being able to parse it with BeautifulSoup), but opening it resulted in a timeout.

Ultimately, I'd like to do something like specify a team's URL to a script, and have said script update associated text files of the team, its players, and various other things in different links.

Considering what I want to do, would it be better to learn how to create a web-crawler, or directly do things via urllib2? I'm under the impression that a spider is faster, but will basically click on links at random unless told to do otherwise (I do not know whether or not this impression is accurate).

I recommend starting out with the native/raw approach, i.e. with urllib(2) or requests (Google for "Python requests library"), and BeautifulSoup. You'll have fun learning the concepts, and it is not a huge amount of work. Especially, by using this approach you learn a lot about the data that you want to obtain. Sooner or later you will hit some limitations (what do you mean with "large amounts" by the way? this is quite a relative term), and you will need to manage them. In any case, the experiences you will obtain by tackling this with urllib/BS will help, so just go for it! — Dr. Jan-Philip Gehrcke, Feb 12 '15 at 21:35
here's [code example of what @Jan-Philip Gehrcke said](http://stackoverflow.com/a/14338006/4279) — jfs, Feb 12 '15 at 21:39

Getting potentially large amounts of data from a website: Should I use Scrapy or urllib2?

0 Answers0