3

I don't people to use script to fetch all the contents of my site easily. Now if I use php curl I can get all the text and data in my site. But I have seen some sites that return only garbage text. For example, this Chinese site: 'www.jjwxc.net/onebook.php?novelid=6971&chapterid=6' if I use the following php

    $ch = curl_init();

    curl_setopt($ch, CURLOPT_URL, $url);

    $headers = array();
    $headers[] = "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png, image/gif, image/x-bitmap, image/jpeg, image/pjpeg, *;q=0.5";
    $headers[] = "Cache-Control: max-age=0";
    $headers[] = "Connection: keep-alive";
    $headers[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    $headers[] = "Accept-Language: en-us,en;q=0.5";
    $headers[] = "Pragma: ";
    $headers[] = 'Content-type: application/x-www-form-urlencoded;charset=UTF-8';

    curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);

    curl_setopt($ch, CURLOPT_ENCODING, '');  
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);        
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);

    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 4);
    curl_setopt($ch, CURLOPT_TIMEOUT, 8);

    curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.12) Gecko/2009070611 Firefox/3.0.12");

    $data = curl_exec($ch);
    curl_close($ch);

    echo $data;

I can only get garbage text. But using browsers even with JavaScript disabled, I can view all the correct characters. Any idea how they make it? Thanks!

user2335065
  • 2,337
  • 3
  • 31
  • 54

2 Answers2

4

That site uses gzip transport encoding. The browser is transparently uncompressing it, while you'd have to manually decompress it when using lower-level tools like curl.

There's ultimately no way to distinguish between curl and a regular browser. Both simply do HTTP requests, and your server answers HTTP requests. You could look at the user agent HTTP header, which will either not be present or say "curl" in the case of curl; but it's trivial to add any and all headers a regular browser adds by default, which makes an HTTP request originated from curl absolutely indistinguishable from an HTTP request originated by a browser.

What you want is simply not possible. If the information is public, it's public. You cannot dictate who gets to see it and who doesn't.

The only way to slow down mass-scraping is to track all requests by IP address and throttle IPs which appear to originate unusually many requests. But even with this, a small array of proxy servers can easily work around this.

deceze
  • 510,633
  • 85
  • 743
  • 889
  • Great. But how can I decompress it? using curl_setopt($ch,CURLOPT_ENCODING , "gzip"); ? I can't do that with the line. It's enough for me if I can replicate what they have done and make people more difficult to fetch what they want than what it is now. The site is renowned for its security against plagiarism in China. Thank you! – user2335065 Sep 09 '15 at 14:54
  • 1
    What kind of "garbage" are you getting back exactly? Perhaps you additionally have problems handling the GB2312 *charset* correctly? – deceze Sep 09 '15 at 14:59
  • I tried return `mb_convert_encoding($data, "UTF-8", "GB2312");` now I can get the text. But I still cannot manipulate the DOM like I normally does on other site? `$dom = new DomDocument(); $dom->loadHTML($html); $finder = new DomXPath($dom); $results = $finder->query("//*[@class='" . $classname . "']");` I cannot get the $results as I normally did with other sites. – user2335065 Sep 09 '15 at 15:20
  • If you can get the plain text and HTML, then the site is not doing anything magical to prevent you from "working" with it! You have some other specific problem. Open a new question if you need that solved. This question has run its course. – deceze Sep 09 '15 at 15:34
1

An answer to the question "how to detect crawlers and cURL" has been given here : https://stackoverflow.com/a/12401278/2761700

You can use them for detecting crawlers disguising their identities with a fake USERAGENT, without risking too much to block real users.

Community
  • 1
  • 1
scandel
  • 1,692
  • 3
  • 20
  • 39