How to prevent crawlers depending on XPath from getting pages contents

Question

There is a library of php that makes evreybody able to attacks me (something like cURL). Then i have a idea to prevent it, I want to use dynamic class name for my elements. look at this:

<div class="<?php $ClassName ?>">anything</div> // $className is taken from the database

Note: $ClassName will vary evry time.

In this case, anyone don't know what is my class name to select my element and then copy my data. Now i have two problem:

How can I communicate between $ClassName and .$ClassName (in css file)? in other words, how can i use php variable for css class names ? (dynamic css classes)
Is it optimized to take all class names from database ?!

are you trying to prevent the theft of your intellectual property? — , May 20 '15 at 23:08
@Dagon yea ! I have some of valuable databases, and i don't know how should i protect them .... — , May 20 '15 at 23:10
You could limit it by requiring users login before they can access the content and/or charge them. This will make it harder for your content to be indexed in search engines though. Also note this isn't considered an "attack" this is crawling/ scraping. — chris85, May 20 '15 at 23:12
If you're trying to prevent your publicly available web content from being scraped then you might want to re-think whether it should be publicly available web content — scrowler, May 20 '15 at 23:14
@chris85 i don't want to limit users, Because this would reduce environmental the popularity of my website. If i can implement my idea, then I can increase my security. — , May 20 '15 at 23:18
in a day some one will post how to scrape a site with 'dynamic' class names - there are a lot of scraping questions posted here (unfortunately) — , May 20 '15 at 23:21
Consider google translate, what does it? why i can't steal any words? how can i do that? anybody know? — , May 20 '15 at 23:24
I can :-) so can any one with a little knowledge. you can only maker harder, never impossible. to my mind that makes it a pointless approach. — , May 20 '15 at 23:25
really ?! but I am a professional thief and couldn't ..! who are you .. :) — , May 20 '15 at 23:26
@Dagon i have a question. when you want to steal data form any website, you need to select this element, right ?! then you selecting it via the class name, right ?! now imagine the class name constantly change. in this case you will not be able to steal, right ? — , May 20 '15 at 23:31
There is nothing stopping anyone from scraping the whole page and working from there. Sure it's a little harder but defiantly doable. If your HTML is structured in a predicable way this would still be trivial to do, without the need for classes or id's. Public is public and there is not much you can do about it. — Jon P, May 21 '15 at 00:15

score 4 · Answer 1 · edited Jun 20 '20 at 09:12

4

Define your class in CSS in your page:

<style>
    .<?php echo $ClassName;?>{
      /* Your CSS */
     }
</style>`

Just make $ClassName as random generated string, you don't need to connect to the database.

Update

Building on bishop answer, you can add changeable DOM structure to your document. You have to introduce two PHP variable such as $start and $close. The $start will have a random opening tags such as <span><div><p> and $close their close, </p></div></span> then enclose your document between them

<?php echo $start;?><div class="<?php $ClassName ?>">anything</div><?php echo close;?>

edited Jun 20 '20 at 09:12

Community

1
1

answered May 20 '15 at 23:45

SaidbakR

13,303
20
101
195

2

Nice to see these techniques documented, thanks! This will stop many naive scrapes, but unfortunately it won't stop smart scrapes -- those that work "inside out", by finding text and then inferring intent from surrounding tags. For those, you have to deploy an operational fix at a lower network level. You still might not be able to detect and stop distributed scrapes, but in the end what commentators said in the original question holds true: public content can be copied, by some means, always. – bishop May 21 '15 at 15:12

score 3 · Answer 2 · answered May 21 '15 at 00:20

Sorry to say, but your effort will be wasted. Even if the class name randomly changes, your DOM can still be attacked positionally, like: div + div > span > a.

But even if you rotated your positions (by eg adding spurious div and span), any scraper worth its salt isn't actually going to care: it's going to find the text on your page, then infer from nearest markup the intent. That's how Google works, BTW.

You have one realistic approach to this problem. First, attach an IDS monitor to your web server. When the IDS detects a scan pattern, throttle or shut down the IP. Or, and this is my favorite, throw the scanner into a honey pot with faked content. Ie, if your actual text reads "Freds widgets are the best in the world", serve an alternate page that reads "Bobs gonads fell short of maritime bliss."

I deploy that latter tactic on a couple of my customers' sites to hilarious results on Chinese copy cats.

Thanks for BTW and IDS, but I don't understand your favorite, what should I do exactly ?? — , May 21 '15 at 00:51
You use software to detect when someone is scanning your pages. When detected, serve the scanner false pages. They'll think they're copying you, but in reality, they're copying bogus content. — bishop, May 21 '15 at 01:03
I'm working on Solr application that has dotnet app works as pages crawler, it get contents using xpath of elements on each site it crawl. The approach of this question should prevent it to work correctly. — SaidbakR, May 21 '15 at 14:41

Shafizadeh · Accepted Answer · 2015-06-01T23:33:15.417

Using the database to get the class name is not optimal until it can be done locally. You should define a array of all class names, and then pick one up them by array_rand, some thing like this:

// php code
   <?php
     $classes = array('class1','class2','class3','class4'); 
     $class_name = $classes[array_rand($classes)];
   ?>


// html code
     <div class="<? php echo $class_name; ?>">anything</div>


// css code
   <style>
     .<? php echo $class_name; ?> {
      // your css codes
     }
   </style>

Note: you must know that you can't use php codes at .css file, then you should write all css codes that you want to be dynamic in your .php file and use <style> stuff </style>.

Meanwhilem, as @sємsєм said, you can creat dynamic html tags.

Some thing like this: (full code)

// php code
   <?php
     // dynamic class
     $classes = array('class1','class2','class3','class4'); 
     $class_name = $classes[array_rand($classes)];

     // dynamic tags
     $tags_statr = array('','<div>','<div><div>','<div><p>','<span><div>');
     $tags_end = array('','</div>','</div></div>','</div></p>','</span></div>');
     $numb = array_rand($tags_statr);
   ?>


// html code
     <?php echo $tags_statr[$numb]; ?>
     <div class="<? php echo $class_name; ?>">anything</div>
     <?php echo $tags_end[$numb]; ?>


// css code
   <style>
     .<? php echo $class_name; ?> {
      // your css codes
     }
   </style>

And for higher security, You can put your content (Here 'anything') (in addition to the external dynamic tags). for example:

<span1>anything</span1> // <span1> changed to <span2,3,4....>

In this case, the adjacent tag with data is also dynamic, And this makes it harder for crawlers.

Finally, I must say that you can't prevent crawlers utterly, you just make it difficult. If you really want to protect your data, you can do things like them:

Increased restrictions for users. (e.g Only registered users can see important information)
Monitor IP that uses of your website (and if suspicious, block it)
Use relevant software. (e.g To limit the search for an IP on a daily basis)

actually the closing tags have to be reversed `''` and `` – Guglie Mar 15 '20 at 14:17 — Guglie, Mar 15 '20 at 14:17

How to prevent crawlers depending on XPath from getting pages contents

3 Answers3

Update

Linked