How to detect fake users ( crawlers ) and cURL

Question

Some other website use cURL and fake http referer to copy my website content. Do we have any way to detect cURL or not real web browser ?

You can't assume scrapers can't run Javascript...there's stuff like rhino out there that will allow them to scrape your site while running Javascript. Unless you place your content behind a digital wall (login, authentication, etc...), it'll available to be scraped. Copyright the material and then sue them if they post it without written permission. If they are in another country, best of luck. — Yzmir Ramirez, Sep 12 '12 at 02:48
I don't if this will change in the future, but cURL (at least PHP cURL) ignores the `Connection: close` HTTP response header. Deducting from this, your best bet would be detecting non standard HTTP clients (browsers usually respect most RFC standards when it comes to headers). Another trick would be a javascript snippet to detect keyboard, mouse and scroll events which then phones home and "validates" the current session. You can even display a dialog to the current user :). A robot will never generate a click event for it, especially if you position it randomly. — oxygen, Sep 12 '12 at 15:23
@Tiberiu-IonuțStan: that is factually incorrect. libcurl (and thus PHP/CURL too) does not ignore a "Connection: close" header. See lib/http.c in the libcurl source code. — Daniel Stenberg, Sep 14 '12 at 08:25
@DanielStenberg I'm just talking from experience, I won't look at the source code. — oxygen, Sep 14 '12 at 08:31
@DanielStenberg I wrote a little script to test the current cURL in PHP, and you are correct. Sorry:) — oxygen, Sep 15 '12 at 10:16

score 100 · Accepted Answer · edited May 23 '17 at 12:25

There is no magic solution to avoid automatic crawling. Everyting a human can do, a robot can do it too. There are only solutions to make the job harder, so hard that only strong skilled geeks may try to pass them.

I was in trouble too some years ago and my first advice is, if you have time, be a crawler yourself (I assume a "crawler" is the guy who crawls your website), this is the best school for the subject. By crawling several websites, I learned different kind of protections, and by associating them I’ve been efficient.

I give you some examples of protections you may try.

Sessions per IP

If a user uses 50 new sessions each minute, you can think this user could be a crawler who does not handle cookies. Of course, curl manages cookies perfectly, but if you couple it with a visit counter per session (explained later), or if your crawler is a noobie with cookie matters, it may be efficient.

It is difficult to imagine that 50 people of the same shared connection will get simultaneousely on your website (it of course depends on your traffic, that is up to you). And if this happens you can lock pages of your website until a captcha is filled.

Idea :

1) you create 2 tables : 1 to save banned ips and 1 to save ip and sessions

create table if not exists sessions_per_ip (
  ip int unsigned,
  session_id varchar(32),
  creation timestamp default current_timestamp,
  primary key(ip, session_id)
);

create table if not exists banned_ips (
  ip int unsigned,
  creation timestamp default current_timestamp,
  primary key(ip)
);

2) at the beginning of your script, you delete entries too old from both tables

3) next you check if ip of your user is banned or not (you set a flag to true)

4) if not, you count how much he has sessions for his ip

5) if he has too much sessions, you insert it in your banned table and set a flag

6) you insert his ip on the sessions per ip table if it has not been already inserted

I wrote a code sample to show in a better way my idea.

<?php

try
{

    // Some configuration (small values for demo)
    $max_sessions = 5; // 5 sessions/ip simultaneousely allowed
    $check_duration = 30; // 30 secs max lifetime of an ip on the sessions_per_ip table
    $lock_duration = 60; // time to lock your website for this ip if max_sessions is reached

    // Mysql connection
    require_once("config.php");
    $dbh = new PDO("mysql:host={$host};dbname={$base}", $user, $password);
    $dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);

    // Delete old entries in tables
    $query = "delete from sessions_per_ip where timestampdiff(second, creation, now()) > {$check_duration}";
    $dbh->exec($query);

    $query = "delete from banned_ips where timestampdiff(second, creation, now()) > {$lock_duration}";
    $dbh->exec($query);

    // Get useful info attached to our user...
    session_start();
    $ip = ip2long($_SERVER['REMOTE_ADDR']);
    $session_id = session_id();

    // Check if IP is already banned
    $banned = false;
    $count = $dbh->query("select count(*) from banned_ips where ip = '{$ip}'")->fetchColumn();
    if ($count > 0)
    {
        $banned = true;
    }
    else
    {
        // Count entries in our db for this ip
        $query = "select count(*)  from sessions_per_ip where ip = '{$ip}'";
        $count = $dbh->query($query)->fetchColumn();
        if ($count >= $max_sessions)
        {
            // Lock website for this ip
            $query = "insert ignore into banned_ips ( ip ) values ( '{$ip}' )";
            $dbh->exec($query);
            $banned = true;
        }

        // Insert a new entry on our db if user's session is not already recorded
        $query = "insert ignore into sessions_per_ip ( ip, session_id ) values ('{$ip}', '{$session_id}')";
        $dbh->exec($query);
    }

    // At this point you have a $banned if your user is banned or not.
    // The following code will allow us to test it...

    // We do not display anything now because we'll play with sessions :
    // to make the demo more readable I prefer going step by step like
    // this.
    ob_start();

    // Displays your current sessions
    echo "Your current sessions keys are : <br/>";
    $query = "select session_id from sessions_per_ip where ip = '{$ip}'";
    foreach ($dbh->query($query) as $row) {
        echo "{$row['session_id']}<br/>";
    }

    // Display and handle a way to create new sessions
    echo str_repeat('<br/>', 2);
    echo '<a href="' . basename(__FILE__) . '?new=1">Create a new session / reload</a>';
    if (isset($_GET['new']))
    {
        session_regenerate_id();
        session_destroy();
        header("Location: " . basename(__FILE__));
        die();
    }

    // Display if you're banned or not
    echo str_repeat('<br/>', 2);
    if ($banned)
    {
        echo '<span style="color:red;">You are banned: wait 60secs to be unbanned... a captcha must be more friendly of course!</span>';
        echo '<br/>';
        echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
    }
    else
    {
        echo '<span style="color:blue;">You are not banned!</span>';
        echo '<br/>';
        echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
    }
    ob_end_flush();
}
catch (PDOException $e)
{
    /*echo*/ $e->getMessage();
}

?>

Visit Counter

If your user uses the same cookie to crawl your pages, you’ll be able to use his session to block it. This idea is quite simple: is it possible that your user visits 60 pages in 60 seconds?

Idea :

Create an array in the user session, it will contains visit time()s.
Remove visits older than X seconds in this array
Add a new entry for the actual visit
Count entries in this array
Ban your user if he visited Y pages

Sample code :

<?php

$visit_counter_pages = 5; // maximum number of pages to load
$visit_counter_secs = 10; // maximum amount of time before cleaning visits

session_start();

// initialize an array for our visit counter
if (array_key_exists('visit_counter', $_SESSION) == false)
{
    $_SESSION['visit_counter'] = array();
}

// clean old visits
foreach ($_SESSION['visit_counter'] as $key => $time)
{
    if ((time() - $time) > $visit_counter_secs) {
        unset($_SESSION['visit_counter'][$key]);
    }
}

// we add the current visit into our array
$_SESSION['visit_counter'][] = time();

// check if user has reached limit of visited pages
$banned = false;
if (count($_SESSION['visit_counter']) > $visit_counter_pages)
{
    // puts ip of our user on the same "banned table" as earlier...
    $banned = true;
}

// At this point you have a $banned if your user is banned or not.
// The following code will allow us to test it...

echo '<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js"></script>';

// Display counter
$count = count($_SESSION['visit_counter']);
echo "You visited {$count} pages.";
echo str_repeat('<br/>', 2);

echo <<< EOT

<a id="reload" href="#">Reload</a>

<script type="text/javascript">

  $('#reload').click(function(e) {
    e.preventDefault();
    window.location.reload();
  });

</script>

EOT;

echo str_repeat('<br/>', 2);

// Display if you're banned or not
echo str_repeat('<br/>', 2);
if ($banned)
{
    echo '<span style="color:red;">You are banned! Wait for a short while (10 secs in this demo)...</span>';
    echo '<br/>';
    echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
}
else
{
    echo '<span style="color:blue;">You are not banned!</span>';
    echo '<br/>';
    echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
}
?>

An image to download

When a crawler need to do his dirty work, that’s for a large amount of data, and in a shortest possible time. That’s why they don’t download images on pages ; it takes too much bandwith and makes the crawling slower.

This idea (I think the most elegent and the most easy to implement) uses the mod_rewrite to hide code in a .jpg/.png/… an image file. This image should be available on each page you want to protect : it could be your logo website, but you’ll choose a small-sized image (because this image must not be cached).

Idea :

1/ Add those lines to your .htaccess

RewriteEngine On
RewriteBase /tests/anticrawl/
RewriteRule ^logo\.jpg$ logo.php

2/ Create your logo.php with the security

<?php

// start session and reset counter
session_start();
$_SESSION['no_logo_count'] = 0;

// forces image to reload next time
header("Cache-Control: no-store, no-cache, must-revalidate");

// displays image
header("Content-type: image/jpg");
readfile("logo.jpg");
die();

3/ Increment your no_logo_count on each page you need to add security, and check if it reached your limit.

Sample code :

<?php

$no_logo_limit = 5; // number of allowd pages without logo

// start session and initialize
session_start();
if (array_key_exists('no_logo_count', $_SESSION) == false)
{
    $_SESSION['no_logo_count'] = 0;
}
else
{
    $_SESSION['no_logo_count']++;
}

// check if user has reached limit of "undownloaded image"
$banned = false;
if ($_SESSION['no_logo_count'] >= $no_logo_limit)
{
    // puts ip of our user on the same "banned table" as earlier...
    $banned = true;
}

// At this point you have a $banned if your user is banned or not.
// The following code will allow us to test it...

echo '<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js"></script>';

// Display counter
echo "You did not loaded image {$_SESSION['no_logo_count']} times.";
echo str_repeat('<br/>', 2);

// Display "reload" link
echo <<< EOT

<a id="reload" href="#">Reload</a>

<script type="text/javascript">

  $('#reload').click(function(e) {
    e.preventDefault();
    window.location.reload();
  });

</script>

EOT;

echo str_repeat('<br/>', 2);

// Display "show image" link : note that we're using .jpg file
echo <<< EOT

<div id="image_container">
    <a id="image_load" href="#">Load image</a>
</div>
<br/>

<script type="text/javascript">

  // On your implementation, you'llO of course use <img src="logo.jpg" />
  $('#image_load').click(function(e) {
    e.preventDefault();
    $('#image_load').html('<img src="logo.jpg" />');
  });

</script>

EOT;

// Display if you're banned or not
echo str_repeat('<br/>', 2);
if ($banned)
{
    echo '<span style="color:red;">You are banned: click on "load image" and reload...</span>';
    echo '<br/>';
    echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
}
else
{
    echo '<span style="color:blue;">You are not banned!</span>';
    echo '<br/>';
    echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
}
?>

Cookie check

You can create cookies in the javascript side to check if your users does interpret javascript (a crawler using Curl does not, for example).

The idea is quite simple : this is about the same as an image check.

Set a $_SESSION value to 1 and increment it in each visits
if a cookie (set in JavaScript) does exist, set session value to 0
if this value reached a limit, ban your user

Code :

<?php

$no_cookie_limit = 5; // number of allowd pages without cookie set check

// Start session and reset counter
session_start();

if (array_key_exists('cookie_check_count', $_SESSION) == false)
{
    $_SESSION['cookie_check_count'] = 0;
}

// Initializes cookie (note: rename it to a more discrete name of course) or check cookie value
if ((array_key_exists('cookie_check', $_COOKIE) == false) || ($_COOKIE['cookie_check'] != 42))
{
    // Cookie does not exist or is incorrect...
    $_SESSION['cookie_check_count']++;
}
else
{
    // Cookie is properly set so we reset counter
    $_SESSION['cookie_check_count'] = 0;
}

// Check if user has reached limit of "cookie check"
$banned = false;
if ($_SESSION['cookie_check_count'] >= $no_cookie_limit)
{
    // puts ip of our user on the same "banned table" as earlier...
    $banned = true;
}

// At this point you have a $banned if your user is banned or not.
// The following code will allow us to test it...

echo '<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js"></script>';

// Display counter
echo "Cookie check failed {$_SESSION['cookie_check_count']} times.";
echo str_repeat('<br/>', 2);

// Display "reload" link
echo <<< EOT

<br/>
<a id="reload" href="#">Reload</a>
<br/>

<script type="text/javascript">

  $('#reload').click(function(e) {
    e.preventDefault();
    window.location.reload();
  });

</script>

EOT;

// Display "set cookie" link
echo <<< EOT

<br/>
<a id="cookie_link" href="#">Set cookie</a>
<br/>

<script type="text/javascript">

  // On your implementation, you'll of course put the cookie set on a $(document).ready()
  $('#cookie_link').click(function(e) {
    e.preventDefault();
    var expires = new Date();
    expires.setTime(new Date().getTime() + 3600000);
    document.cookie="cookie_check=42;expires=" + expires.toGMTString();
  });

</script>
EOT;


// Display "unset cookie" link
echo <<< EOT

<br/>
<a id="unset_cookie" href="#">Unset cookie</a>
<br/>

<script type="text/javascript">

  // On your implementation, you'll of course put the cookie set on a $(document).ready()
  $('#unset_cookie').click(function(e) {
    e.preventDefault();
    document.cookie="cookie_check=;expires=Thu, 01 Jan 1970 00:00:01 GMT";
  });

</script>
EOT;

// Display if you're banned or not
echo str_repeat('<br/>', 2);
if ($banned)
{
    echo '<span style="color:red;">You are banned: click on "Set cookie" and reload...</span>';
    echo '<br/>';
    echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
}
else
{
    echo '<span style="color:blue;">You are not banned!</span>';
    echo '<br/>';
    echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
}

Protection against proxies

Some words about the different kind of proxies we may find over the web :

A “normal” proxy displays information about user connection (notably, his IP)
An anonymous proxy does not display IP, but gives information about proxy usage on header.
A high-anonyous proxy do not display user IP, and do not display any information that a browser may not send.

It is easy to find a proxy to connect any website, but it is very hard to find high-anonymous proxies.

Some $_SERVER variables may contain keys specifically if your users is behind a proxy (exhaustive list took from this question):

CLIENT_IP
FORWARDED
FORWARDED_FOR
FORWARDED_FOR_IP
HTTP_CLIENT_IP
HTTP_FORWARDED
HTTP_FORWARDED_FOR
HTTP_FORWARDED_FOR_IP
HTTP_PC_REMOTE_ADDR
HTTP_PROXY_CONNECTION'
HTTP_VIA
HTTP_X_FORWARDED
HTTP_X_FORWARDED_FOR
HTTP_X_FORWARDED_FOR_IP
HTTP_X_IMFORWARDS
HTTP_XROXY_CONNECTION
VIA
X_FORWARDED
X_FORWARDED_FOR

You may give a different behavior (lower limits etc) to your anti crawl securities if you detect one of those keys on your $_SERVER variable.

Conclusion

There is a lot of ways to detect abuses on your website, so you'll find a solution for sure. But you need to know precisely how your website is used, so your securities will not be aggressive with your "normal" users.

I like u use set cookie on Javascript, and then use PHP to read it, and BAN. — Ken Le, Sep 15 '12 at 15:33
Pinterest.com tried all of this, and i still managed to do it ;) -- code http://stackoverflow.com/questions/28629343/why-isnt-curl-logging-into-external-website/28660511#28660511 — hanshenrik, Aug 20 '15 at 12:45
All hardcore geeks can pass any protection, even captchas :-). I think the most efficient protections are the ones inside images and encoded javascript. — Alain Tiemblo, Aug 20 '15 at 12:48
thank you for providing such an extensive answer. I would highly recommend that you look into escaping the parameters that are sent to the queries or use prepared statements to avoid SQL injections. — Svetoslav Marinov, Nov 15 '22 at 15:15

score 2 · Answer 2 · edited May 23 '17 at 12:02

2

Remember: HTTP is not magic. There's a defined set of headers sent with each HTTP request; if these headers are sent by web-browser, they can as well be sent by any program - including cURL (and libcurl).

Some consider it a curse, but on the other hand, it's a blessing, as it greatly simplifies functional testing of web applications.

UPDATE: As unr3al011 rightly noticed, curl doesn't execute JavaScript, so in theory it's possible to create a page that will behave differently when viewed by grabbers (for example, with setting and, later, checking a specific cookie by JS means).

Still, it'd be a very fragile defense. The page's data still had to be grabbed from server - and this HTTP request (and it's always HTTP request) can be emulated by curl. Check this answer for example of how to defeat such defense.

... and I didn't even mention that some grabbers are able to execute JavaScript. )

edited May 23 '17 at 12:02

Community

1
1

answered Sep 04 '12 at 06:05

raina77ow

103,633
15
192
229

right now cURL can set user-agent, and http refefer. So, we can't detect it at all ? – Ken Le Sep 04 '12 at 06:09
No. I'd say 'unfortunately', but then again, it's actually not: if curl were not able to send it, any other library would have taken its place. – raina77ow Sep 04 '12 at 06:10
You "can" detect curl. If you assume that a request is from curl, check if it can execute Javascript. Curl cannot execute Javascript. – pila Sep 05 '12 at 08:14
1

how can u detect "cannot execute javascript" on PHP Code ? – Ken Le Sep 06 '12 at 04:53
1

@KenLe Did you check the answer I've mentioned? It contains both HTML with JS checking - and PHP code to defeat it. I didn't see much point in including this code here as well, as it, in general, is not a solution. – raina77ow Sep 06 '12 at 07:07

score 2 · Answer 3 · answered Mar 10 '18 at 23:58

You can detect cURL-Useragent by the following method. But be warned the useragent could be overwritten by user, anyway default settings could be recognized by:

function is_curl() {
    if (stristr($_SERVER["HTTP_USER_AGENT"], 'curl'))
        return true;
}

score 0 · Answer 4 · answered Sep 13 '12 at 07:47

The way of avoid fake referers is tracking the user

You can track the user by one or more of this methods:

Save a cookie in the browser client with some special code (ex: last url visited, a timestamp) and verify it in each response of your server.
Same as before but using sessions instead of explicit cookies

For cookies you should add cryptographic security like.

[Cookie]
url => http://someurl/
hash => dsafdshfdslajfd

hash is calulated in PHP by this way

$url = $_COOKIE['url'];
$hash = $_COOKIE['hash'];
$secret = 'This is a fixed secret in the code of your application';

$isValidCookie = (hash('algo', $secret . $url) === $hash);

$isValidReferer = $isValidCookie & ($_SERVER['HTTP_REFERER'] === $url)

This is a stub with the basics, you should improve it with your own needs — Maks3w, Sep 13 '12 at 07:50

Rayvyn · Answer 5 · 2012-09-12T18:20:58.587

-1

As some have mentioned cURL cannot execute JavaScritp (to my knowledge) so you could possibly try setting someting up like raina77ow suggest but that would not wokrk for other grabbers/donwloaders.

I suggest you try building a bot trap that way you deal with the grabbers/downloaders that can execute JavaScript.

I don't know of any 1 solution to fully prevent this, so my best recommendation would be to try multiple solutions:

1) only allow known user agents such as all mainstream browsers in your .htaccess file

2) Set up your robots.txt to prevent bots

3) Set up a bot trap for bots that do not respect the robots.txt file

edited Sep 12 '12 at 18:20

answered Sep 12 '12 at 17:19

Rayvyn

77
1
7

-1 for "1) Deny any unknown user agent in your .htaccess file" – oxygen Sep 12 '12 at 17:52
What I mean by point 1 is to only allow known user agents such as all mainstream browsers, that way if someone is using a grabber with a different user agent they will be denied. – Rayvyn Sep 12 '12 at 17:59
What you mean and what it says right now are entirely different. – oxygen Sep 12 '12 at 18:14
Concerning 1): On top of that, it is very easy to change the User-Agent the cURL 'broadcasts', see also the comments under the answer of Marcel Gent Simonis – Benjamin Seiller Sep 13 '12 at 07:42
Careful not to block search engines.. their bots can easily change useragent – mowgli Aug 23 '14 at 10:42

score -4 · Answer 6 · edited Sep 05 '12 at 13:49

-4

put this into root folder as .htaccess file. it may help. I found it on one webhosting provider site but dunno what that means :)

SetEnvIf User-Agent ^Teleport graber   
SetEnvIf User-Agent ^w3m graber    
SetEnvIf User-Agent ^Offline graber   
SetEnvIf User-Agent Downloader graber  
SetEnvIf User-Agent snake graber  
SetEnvIf User-Agent Xenu graber   
Deny from env=graber

edited Sep 05 '12 at 13:49

Zoltan Toth

46,981
12
120
134

answered Sep 05 '12 at 08:04

GentSVK

324
1
3
12

9

are you sure you should post solutions that you don't know what they mean? – eis Sep 05 '12 at 08:08
you can find what that means by yourself, Ive just copied this one. They say it will be difficult to grab your web page with this. I think it its restriction for some grabbers as you can see. – GentSVK Sep 05 '12 at 08:19
4

what it does is that it takes the specified user agent string parts, defines them into "graber" env and denies the access. If none of those are used in the user agent, this does nothing. – eis Sep 05 '12 at 09:03

How to detect fake users ( crawlers ) and cURL

6 Answers6

Sessions per IP

Visit Counter

An image to download

Cookie check

Protection against proxies

Conclusion

Linked

Related