2

I have a site in PHP. In recent weeks, my website is getting lot of automated hits from a single location. It indicates that someone is "poaching" the content in an automated manner, instead of visiting the site through a proper browser. I suppose this is being done by tools/utilities like WGET (or CURL or whatever).

Is there a way such automated access can be blocked?

In an attempt to investigate, I tried using WGET on popular sites like Yahoo, US News and Bloomberg, the WGET utility was successful in downloading the the pages (HTML code) from Yahoo and US News. However, similar attempt on a sample Bloomberg page failed.

Command I used:

wget64.exe https://www.bloomberg.com/research//stocks/snapshot/snapshot_article.asp?ticker=CWEN

Resultant file that got saved had the following:

<h2 class="main__heading">We've detected unusual activity from your computer network</h2>

    <p class="continue">To continue, please click the box below to let us know you're not a robot.</p>
    <div id="px-captcha"></div>
</section>
<section class="box">
    <section class="info">
        <h3 class="info__heading">Why did this happen?</h3>
        <p class="info__text">Please make sure your browser supports JavaScript and cookies and that you are not blocking them from loading. For more information you can review our <a class="info__link" href="/notices/tos">Terms of Service</a> and <a class="info__link" href="/notices/tos">Cookie Policy</a>

It indicates that at least Bloomberg has a way to prevent such automated access. Does anyone know what a webmaster can implement to prevent such automated access (like Bloomberg has implemented).

While I agree that access on the internet should be free, sometimes a few boundaries need to be implemented to prevent unauthorized access.

Aquaholic
  • 863
  • 9
  • 25
  • Possible duplicate of [How do I prevent site scraping?](https://stackoverflow.com/questions/3161548/how-do-i-prevent-site-scraping) – Nico Haase Mar 06 '19 at 08:12

1 Answers1

3

Wget can easily be captured using the following in your .htaccess file.

RewriteCond %{HTTP_USER_AGENT} wget.* [NC]
RewriteRule .* - [F,L]

However, if the User Agent string is changed, then you may never know that it is Wget.

Also you may look on how to block robots. http://www.robotstxt.org/

http://www.javascriptkit.com/howto/htaccess13.shtml

Andrei Lupuleasa
  • 2,677
  • 3
  • 14
  • 32
  • Thanks @AndreiLupuleasa for the inputs. Playing around with htaccess is tricky, and I'm not that conversant with htaccess, so it will help if you can elaborate on what the above mentioned code actually does? – Aquaholic Mar 06 '19 at 08:05
  • 3
    You can look on this tutorial http://www.javascriptkit.com/howto/htaccess13.shtml, they explain everything. – Andrei Lupuleasa Mar 06 '19 at 08:08