114

I need to extract contents from a website, but the application doesn’t provide any application programming interface or another mechanism to access that data programmatically.

I found a useful third-party tool called Import.io that provides click and go functionality for scraping web pages and building data sets, the only thing is I want to keep my data locally and I don't want to subscribe to any subscription plans.

What kind of technique does this company use for scraping the web pages and building their datasets? I found some web scraping frameworks pjscrape & Scrapy could they provide such a feature

0x1ad2
  • 8,014
  • 9
  • 35
  • 48
  • 4
    **PHP is certainly not out of the question, that is plain wrong, obviously. https://gist.github.com/krakjoe/b1526fcc828621e840cb** – Joe Watkins Mar 05 '14 at 12:23
  • @JoeWatkins that looks really cool, does it need a special PHP configuration to run? And how is the performace in comparsion with tools/languages provided below? – 0x1ad2 Mar 06 '14 at 08:48
  • 1
    It requires a thread safe build of PHP, and pthreads, read https://github.com/krakjoe/pthreads/blob/master/README.md, you can find me in chat if you want help, me or anyone else :) – Joe Watkins Mar 06 '14 at 09:45
  • @0x1ad2 If you want to keep data locally then you should try software (http://www.datascraping.co) instead Web APIs. Most of the tools use Xpath, CSS selector and REGEX to extract the data from websites and Data Scraping Studio support all these 3 features. – Vikash Rathee Dec 07 '15 at 10:53
  • There's two ways, one is to roll out your own using free/open source libraries which takes a lot of effort. You can literally generate an ajax web crawler for any site using https://scrape.it It is a paid tool but it worked when neither free tools like import.io or kimono could render. – I Love Python Feb 21 '16 at 22:48
  • I started scraping using Puppeteer few days ago and find very useful... I can get any thing from page. https://www.npmjs.com/package/puppeteer – Syed Haseeb Dec 28 '20 at 07:03

2 Answers2

277

You will definitely want to start with a good web scraping framework. Later on you may decide that they are too limiting and you can put together your own stack of libraries but without a lot of scraping experience your design will be much worse than pjscrape or scrapy.

Note: I use the terms crawling and scraping basically interchangeable here. This is a copy of my answer to your Quora question, it's pretty long.

Tools

Get very familiar with either Firebug or Chrome dev tools depending on your preferred browser. This will be absolutely necessary as you browse the site you are pulling data from and map out which urls contain the data you are looking for and what data formats make up the responses.

You will need a good working knowledge of HTTP as well as HTML and will probably want to find a decent piece of man in the middle proxy software. You will need to be able to inspect HTTP requests and responses and understand how the cookies and session information and query parameters are being passed around. Fiddler (http://www.telerik.com/fiddler) and Charles Proxy (http://www.charlesproxy.com/) are popular tools. I use mitmproxy (http://mitmproxy.org/) a lot as I'm more of a keyboard guy than a mouse guy.

Some kind of console/shell/REPL type environment where you can try out various pieces of code with instant feedback will be invaluable. Reverse engineering tasks like this are a lot of trial and error so you will want a workflow that makes this easy.

Language

PHP is basically out, it's not well suited for this task and the library/framework support is poor in this area. Python (Scrapy is a great starting point) and Clojure/Clojurescript (incredibly powerful and productive but a big learning curve) are great languages for this problem. Since you would rather not learn a new language and you already know Javascript I would definitely suggest sticking with JS. I have not used pjscrape but it looks quite good from a quick read of their docs. It's well suited and implements an excellent solution to the problem I describe below.

A note on Regular expressions: DO NOT USE REGULAR EXPRESSIONS TO PARSE HTML. A lot of beginners do this because they are already familiar with regexes. It's a huge mistake, use xpath or css selectors to navigate html and only use regular expressions to extract data from actual text inside an html node. This might already be obvious to you, it becomes obvious quickly if you try it but a lot of people waste a lot of time going down this road for some reason. Don't be scared of xpath or css selectors, they are WAY easier to learn than regexes and they were designed to solve this exact problem.

Javascript-heavy sites

In the old days you just had to make an http request and parse the HTML reponse. Now you will almost certainly have to deal with sites that are a mix of standard HTML HTTP request/responses and asynchronous HTTP calls made by the javascript portion of the target site. This is where your proxy software and the network tab of firebug/devtools comes in very handy. The responses to these might be html or they might be json, in rare cases they will be xml or something else.

There are two approaches to this problem:

The low level approach:

You can figure out what ajax urls the site javascript is calling and what those responses look like and make those same requests yourself. So you might pull the html from http://example.com/foobar and extract one piece of data and then have to pull the json response from http://example.com/api/baz?foo=b... to get the other piece of data. You'll need to be aware of passing the correct cookies or session parameters. It's very rare, but occasionally some required parameters for an ajax call will be the result of some crazy calculation done in the site's javascript, reverse engineering this can be annoying.

The embedded browser approach:

Why do you need to work out what data is in html and what data comes in from an ajax call? Managing all that session and cookie data? You don't have to when you browse a site, the browser and the site javascript do that. That's the whole point.

If you just load the page into a headless browser engine like phantomjs it will load the page, run the javascript and tell you when all the ajax calls have completed. You can inject your own javascript if necessary to trigger the appropriate clicks or whatever is necessary to trigger the site javascript to load the appropriate data.

You now have two options, get it to spit out the finished html and parse it or inject some javascript into the page that does your parsing and data formatting and spits the data out (probably in json format). You can freely mix these two options as well.

Which approach is best?

That depends, you will need to be familiar and comfortable with the low level approach for sure. The embedded browser approach works for anything, it will be much easier to implement and will make some of the trickiest problems in scraping disappear. It's also quite a complex piece of machinery that you will need to understand. It's not just HTTP requests and responses, it's requests, embedded browser rendering, site javascript, injected javascript, your own code and 2-way interaction with the embedded browser process.

The embedded browser is also much slower at scale because of the rendering overhead but that will almost certainly not matter unless you are scraping a lot of different domains. Your need to rate limit your requests will make the rendering time completely negligible in the case of a single domain.

Rate Limiting/Bot behaviour

You need to be very aware of this. You need to make requests to your target domains at a reasonable rate. You need to write a well behaved bot when crawling websites, and that means respecting robots.txt and not hammering the server with requests. Mistakes or negligence here is very unethical since this can be considered a denial of service attack. The acceptable rate varies depending on who you ask, 1req/s is the max that the Google crawler runs at but you are not Google and you probably aren't as welcome as Google. Keep it as slow as reasonable. I would suggest 2-5 seconds between each page request.

Identify your requests with a user agent string that identifies your bot and have a webpage for your bot explaining it's purpose. This url goes in the agent string.

You will be easy to block if the site wants to block you. A smart engineer on their end can easily identify bots and a few minutes of work on their end can cause weeks of work changing your scraping code on your end or just make it impossible. If the relationship is antagonistic then a smart engineer at the target site can completely stymie a genius engineer writing a crawler. Scraping code is inherently fragile and this is easily exploited. Something that would provoke this response is almost certainly unethical anyway, so write a well behaved bot and don't worry about this.

Testing

Not a unit/integration test person? Too bad. You will now have to become one. Sites change frequently and you will be changing your code frequently. This is a large part of the challenge.

There are a lot of moving parts involved in scraping a modern website, good test practices will help a lot. Many of the bugs you will encounter while writing this type of code will be the type that just return corrupted data silently. Without good tests to check for regressions you will find out that you've been saving useless corrupted data to your database for a while without noticing. This project will make you very familiar with data validation (find some good libraries to use) and testing. There are not many other problems that combine requiring comprehensive tests and being very difficult to test.

The second part of your tests involve caching and change detection. While writing your code you don't want to be hammering the server for the same page over and over again for no reason. While running your unit tests you want to know if your tests are failing because you broke your code or because the website has been redesigned. Run your unit tests against a cached copy of the urls involved. A caching proxy is very useful here but tricky to configure and use properly.

You also do want to know if the site has changed. If they redesigned the site and your crawler is broken your unit tests will still pass because they are running against a cached copy! You will need either another, smaller set of integration tests that are run infrequently against the live site or good logging and error detection in your crawling code that logs the exact issues, alerts you to the problem and stops crawling. Now you can update your cache, run your unit tests and see what you need to change.

Legal Issues

The law here can be slightly dangerous if you do stupid things. If the law gets involved you are dealing with people who regularly refer to wget and curl as "hacking tools". You don't want this.

The ethical reality of the situation is that there is no difference between using browser software to request a url and look at some data and using your own software to request a url and look at some data. Google is the largest scraping company in the world and they are loved for it. Identifying your bots name in the user agent and being open about the goals and intentions of your web crawler will help here as the law understands what Google is. If you are doing anything shady, like creating fake user accounts or accessing areas of the site that you shouldn't (either "blocked" by robots.txt or because of some kind of authorization exploit) then be aware that you are doing something unethical and the law's ignorance of technology will be extraordinarily dangerous here. It's a ridiculous situation but it's a real one.

It's literally possible to try and build a new search engine on the up and up as an upstanding citizen, make a mistake or have a bug in your software and be seen as a hacker. Not something you want considering the current political reality.

Who am I to write this giant wall of text anyway?

I've written a lot of web crawling related code in my life. I've been doing web related software development for more than a decade as a consultant, employee and startup founder. The early days were writing perl crawlers/scrapers and php websites. When we were embedding hidden iframes loading csv data into webpages to do ajax before Jesse James Garrett named it ajax, before XMLHTTPRequest was an idea. Before jQuery, before json. I'm in my mid-30's, that's apparently considered ancient for this business.

I've written large scale crawling/scraping systems twice, once for a large team at a media company (in Perl) and recently for a small team as the CTO of a search engine startup (in Python/Javascript). I currently work as a consultant, mostly coding in Clojure/Clojurescript (a wonderful expert language in general and has libraries that make crawler/scraper problems a delight)

I've written successful anti-crawling software systems as well. It's remarkably easy to write nigh-unscrapable sites if you want to or to identify and sabotage bots you don't like.

I like writing crawlers, scrapers and parsers more than any other type of software. It's challenging, fun and can be used to create amazing things.

Jesse Sherlock
  • 3,080
  • 1
  • 18
  • 10
  • 4
    I used to agree with you about PHP being a bad choice, but with the right libraries it's not too bad. Regex and array/sting manipulation is clumsy but on the plus side it's fast and everywhere. – pguardiario Mar 05 '14 at 01:01
  • 3
    In an environment where there are a few libraries that make this a pleasure and a lot that make it quite simple and quite easy ... why would you settle for "not too bad". I agree, it's doable in PHP (and FORTRAN, C, VB, etc.) but unless your problem is really really simple then it would be a much better idea to use the right tools for the job. And again, unless you have an incredibly simple problem to solve ... what does it matter that regex is everywhere? Installing libraries is much simpler than almost every scraping problem. And actually, regex is often quite slow for this problem. – Jesse Sherlock Apr 17 '14 at 00:24
  • I think I can do it as easily in PHP as you can in whatever you're using. – pguardiario Apr 17 '14 at 08:24
  • 5
    You might be right, but I know for a fact that *I* can't do it as easily in PHP. Before I moved away from PHP I had close to a decade of professional PHP experience. I spent more than a year full time building a scraping system at scale, in Python, and I can't imagine doing without some of the nice libraries that aren't available in PHP or doing without the concise meta-programming techniques available in Python. That's also the reason I moved to Clojure, to get even more powerful meta-programming abilities. – Jesse Sherlock Apr 18 '14 at 22:55
  • Thanks for the great post, @JesseSherlock. Could you point to some useful clojure libraries? I know about enlive for the scraping, but what's your tool of choice for the actual spider? Anything else you found useful? – mat_dw Jul 04 '14 at 06:01
  • 4
    Enlive, along with the power of Clojure itself for project specific code, are the biggest winners. Schema is a great validation library, which is such a big part of information extraction code. I'm currently really happy with the easy interop with Java world for things like Mahout as well as Nashorn/Rhino for some kinds of js execution. And Clojure people are the types who write libs like this https://github.com/shriphani/subotai so that you don't have to. ... continued in next comment ... – Jesse Sherlock Jul 04 '14 at 19:59
  • 3
    I've also found that when you really need a real browser and need to go with phantomjs/casperjs it's really great to use clojurescript (often code shared between clj and cljs using cljx) to write the js you inject into the page instead of clojurescript. Core.async is great for coordinating highly concurrent crawling code on the server as well as getting out of callback hell inside the js environment (coordinating browser automation with core.async cljs code inside phantomjs is heaven compared to the alternatives). – Jesse Sherlock Jul 04 '14 at 19:59
  • 2
    For actual crawling I'm still trying to find my favorite toolset, for from scratch custom crawlers http-kit + core.async is pretty wonderful. When I need a more standard crawler using Nutch to crawl and Clojure libs to process has been pretty successful. – Jesse Sherlock Jul 04 '14 at 20:00
  • I don't suggest PHP if you are going to run the crawler, say, 24 hours. Try Node.JS, Python or other alternative out there that can handle that long process. – P.M Nov 25 '14 at 02:04
  • I've written a couple crawlers recently completely in php and I can see how dealing with asynchronous requests would be rather challenging. But the code worked perfectly fine and in some cases I think it would be a perfectly acceptable tool. That said, thanks for the great post and info! – But those new buttons though.. Dec 07 '14 at 16:37
  • 3
    @mat_dw this is a late addition to the Clojure libs list, but since this is fairly popular I will add that if you have to ingest any network stuff including HTML I would suggest using Aleph over Http-Kit. It's got support for back-pressure (if anyone is sending you data) and the new (0.4.0 as of the time of this post) version uses a new enough version of Netty that you can easily add proxies as well as the other Netty pipeline transforms. – Jesse Sherlock Apr 23 '15 at 00:24
  • I scrape with PHP all the time. Packages like Guzzle + DomCrawler, or Goutte, etc... make it pretty easy. – ryanwinchester Jul 03 '15 at 00:01
  • @JesseSherlock, is there a place where I can look at some of your clojure/clojurescript scraping code ? Thanks – user3639782 Feb 07 '17 at 08:19
  • 2
    @user3639782, I don't have any good code published right now, all the work so far has been on contract and not owned by me. I am literally working on a scraping library project right now that will be open source and have both Clojure and Clojurescript code but it is very early days. I've made a note to myself to ping you when I do put up the first set of commits to github, hopefully in a month or two. – Jesse Sherlock Feb 10 '17 at 09:12
  • @JesseSherlock, Fine! Thanks to ping me as I haven't found out your github page. – user3639782 Feb 10 '17 at 12:35
  • 1
    PHP is not out. – 6opko May 10 '17 at 17:24
  • Now you should make a post about how to create an almost if not completely unscrapable website! – oldboy Jun 04 '18 at 19:15
23

Yes you can do it yourself. It is just a matter of grabbing the sources of the page and parsing them the way you want.

There are various possibilities. A good combo is using python-requests (built on top of urllib2, it is urllib.request in Python3) and BeautifulSoup4, which has its methods to select elements and also permits CSS selectors:

import requests
from BeautifulSoup4 import BeautifulSoup as bs
request = requests.get("http://foo.bar")
soup = bs(request.text) 
some_elements = soup.find_all("div", class_="myCssClass")

Some will prefer xpath parsing or jquery-like pyquery, lxml or something else.

When the data you want is produced by some JavaScript, the above won't work. You either need python-ghost or Selenium. I prefer the latter combined with PhantomJS, much lighter and simpler to install, and easy to use:

from selenium import webdriver
client = webdriver.PhantomJS()
client.get("http://foo")
soup = bs(client.page_source)

I would advice to start your own solution. You'll understand Scrapy's benefits doing so.

ps: take a look at scrapely: https://github.com/scrapy/scrapely

pps: take a look at Portia, to start extracting information visually, without programming knowledge: https://github.com/scrapinghub/portia

Ehvince
  • 17,274
  • 7
  • 58
  • 79
  • Alright thanks for the anwser, the only problem is that Python isn't in my skill-set. Are there other good programming languages that could do the same tasks? I mainly work with PHP and Javascript. – 0x1ad2 Mar 04 '14 at 13:29
  • Sorry for the confusion(I mentioned the Python framework in my question), but if Python is the best way to do it I could learn it. – 0x1ad2 Mar 04 '14 at 13:43
  • 1
    Python makes scrapy very easy. It is also easy to learn. The best scraper that performs well at the moment is scrapy. They also have a very good documentation. – Abhishek Mar 04 '14 at 16:02