1

Im not talking about extracting a text, or downloading a web page. but I see people downloading whole web sites, for example, there is a directory called "example" and it isnt even linked in web site, how do I know its there? how do I download "ALL" pages of a website? and how do I protect against?

for example, there is "directory listing" in apache, how do I get list of directories under root, if there is a index file already?

this question is not language-specific, I would be happy with just a link that explains techniques that does this, or a detailed answer.

a23ziz
  • 13
  • 3
  • You may want to use: http://www.httrack.com/ – Amal Murali Sep 28 '13 at 14:32
  • First off you might want to [disable directory listing](http://stackoverflow.com/questions/2530372/how-do-i-disable-directory-browsing). -- but that doesn't stop httrack (which just follows links from a certain page). You can also setup robots.txt, but any evil scraper will ignore that. – Dave Chen Sep 28 '13 at 14:38
  • about directory listing, how do I get list of directories under root, if there is a index file already? – a23ziz Sep 28 '13 at 14:40
  • Apache will serve a page instead of the listing. If a file isn't referenced anywhere (robots, direct links), then it won't be known to a scraper. – Dave Chen Sep 28 '13 at 14:43

2 Answers2

1

Ok so to answer your questions one by one; how do you know that a 'hidden' (unlinked) directory is on the site? Well you don't, but you can check the most common directory names, whether they return HTTP 200 or 404... With couple of threads you will be able to check even thousands a minute. That being said, you should always consider the amount of requests you are making in regards to the specific website and the amount of traffic it handles, because for small to mid-sized websites this could cause connectivity issues or even a short DoS, which of course is undesirable. Also you can use search engines to search for unlinked content, it may have been discovered by the search engine on accident, there might have been a link to it from another site etc. (for instance google site:targetsite.com will list all the indexed pages). How you download all pages of a website has already been answered, essentially you go to the base link, parse the html for links, images and other content which points to a onsite content and follow it. Further you deconstruct links to their directories and check for indexes. You will also bruteforce common directory and file names.

Well you really effectively can't protect against bots, unless you limit user experience. For instance you could limit the number of requests per minute; but if you have ajax site, a normal user will also be producing a large number of requests so that really isn't a way to go. You can check user agent and white list only 'regular' browsers, however most scraping scripts will identify themselves as regular browsers so that won't help you much either. Lastly you can blacklist IPs, however that is not very effective, there is plenty of proxies, onion routing and other ways to change your IP.

You will get directory list only if a) it is not forbidden in the server config and b) there isn't the default index file (default on apache index.html or index.php).

In practical terms it is good idea not to make it easier to the scraper, so make sure your website search function is properly sanitized etc. (it doesn't return all records on empty query, it filters % sign if you are using LIKE mysql syntax...). And of course use CAPTCHA if appropriate, however it must be properly implemented, not a simple "what is 2 + 2" or couple of letters in common font with plain background.

Another protection from scraping might be using referer checks to allow access to certain parts of the website; however it is better to just forbid access to any parts of the website you don't want public on server side (using .htaccess for example).

Lastly from my experience scrapers will only have basic js parsing capabilities, so implementing some kind of check in javascript could work, however here again you'd also be excluding all web visitors with js switched off (and with noscript or similar browser plugin) or with outdated browser.

cyber-guard
  • 1,776
  • 14
  • 30
  • "With couple of threads you will be able to check even thousands a minute" - if those URLs are all on the same site, that would amount to a denial of service attack. "Most scraping scripts will identify themselves as regular browsers" - I agree that some probably do, but we should be making it clear that it is _generally_ thought to be a bad behaviour for them to do so. – halfer Nov 08 '13 at 09:16
  • Ad your first point: Well that very much depends on the website and the traffic it handles, so saying that making a couple of thousand requests a minute to a single website always equals to a DoS is wrong (not taking into account the number of links, time or the traffic). Ad second point: I can't agree either, I would say that if the ToS of the website specifically doesn't prohibit scraping, it is perfectly fine to code the UA string mimicking a regular browser. Surely I would say it applies for search engine indexing bots, for the purpose of controlling what is indexed, but that's about it. – cyber-guard Nov 09 '13 at 11:07
  • 1
    "A couple of thousand requests a minute to a single website" is unquestionably the consumption of a large amount of server resource, even if it does not bring the site down. We may have to agree to disagree, but at least people reading your answer will see my counterarguments. It is worth my pointing out though that I am not anti-scraping; in fact, my current project depends upon it. But scraping needs to be done with care, in relation to the unfair consumption of resources, the costs forced upon third parties when they are excessively scraped, and for issues around scraping personal data. – halfer Nov 09 '13 at 11:31
  • Ok on this I do agree; the matter of scaling your scraper must take into account the website you will be scraping and its resources (which I would say is common sense). That being said for instance say 2-3 thousand requests per minute for a large site with solid infrastructure, load balancing etc., like facebook, I would say could hardly amount to a consumption of a large amount of server resources, as it equals at most to a couple of hundred users online... – cyber-guard Nov 10 '13 at 10:51
0

To fully "download" a site you need a web crawler, that in addition to follow the urls also saves their content. The application should be able to :

  • Parse the "root" url
  • Identify all the links to other pages in the same domain
  • Access and download those and all the ones contained in these child pages
  • Remember which links have already been parsed, in order to avoid loops

A search for "web crawler" should provide you with plenty of examples.

I don't know counter measures you could adopt to avoid this: in most cases you WANT bots to crawl your websites, since it's the way search engines will know about your site.

I suppose you could look at traffic logs and if you identify (by ip address) some repeating offenders you could blacklist them preventing access to the server.

vinaut
  • 2,416
  • 15
  • 13
  • how do I get list of directories under root, if there is a index file already? – a23ziz Sep 28 '13 at 14:40
  • You will still see links to the files in the source html. It should be no different. You will have to parse html to identify the links. – vinaut Sep 28 '13 at 14:42
  • what do you mean I should see links in html? I think I didnt tell my question correctly, sorry about that. What I meant was if I uploaded a index.php file to my root with just a placeholder text, so all it serves is that. How do I get directories then? – a23ziz Sep 28 '13 at 15:05
  • You don't (ot at least shouldn't) get a listing. If you are accessing the root from the outside you will see just the html page the server will send you. The directory structure of the server is hidden to the client. Any crawler works by following the links in the HTML, they cannot access the server file system directly. – vinaut Sep 28 '13 at 15:29