1

i have a home page with some links and mail ids i need to stop scraping my urls and mail-ids form that web page... i have used robots.txt but most of the bad crawlers wont respect that....

JoseK
  • 31,141
  • 14
  • 104
  • 131
raj
  • 63
  • 1
  • 4
  • `robots.txt` is only good for preventing respectable crawlers - that is most search engines (but even google admits to simulating a page visit as a human (ignoring robots, falsifying browser string). Obfuscating the content (with JS, or encoded characters may help, securing the page (requiring login, or a CAPTCHA entry first) could both help. – Rudu Sep 03 '10 at 13:08

3 Answers3

0

Well, you can always try obfuscating your URLs with javascript or images or something. But please don't do that. You'll just anger people with old browsers and blind people who use screen readers. Just use a spam filter to stop people spamming your e-mail address.

If you have a content-heavy site and you want to stop people from scraping your content, you might try limiting visitors to ten hits every ten seconds. That'll be enough for most visitors, but it'll significantly decrease the speed of content scrapers. You can tweak this algorithm as you go, and ban the IPs of serious offenders.

Steve Rukuts
  • 9,167
  • 3
  • 50
  • 72
0

You could encode some links, e.g. foo@bar.com instead of foo@bar.com.

Sjoerd
  • 74,049
  • 16
  • 131
  • 175
0

Use a honeypot link that is hidden from real users. Disallow the url in robots.txt and add a nofollow on it so that respectable engines won't ever hit it. Hide the link with javascript when the page loads so legit users will not click it. Temporarily block the IP or session of anyone that hits the link.

Robert Swisher
  • 1,300
  • 11
  • 12