How to block search engines from indexing all urls beginning with origin.domainname.com

Question

I have www.domainname.com, origin.domainname.com pointing to the same codebase. Is there a way, I can prevent all urls of basename origin.domainname.com from getting indexed.

Is there some rule in robot.txt to do it. Both the urls are pointing to the same folder. Also, I tried redirecting origin.domainname.com to www.domainname.com in htaccess file but it doesnt seem to work..

If anyone who has had a similar kind of problem and can help, I shall be grateful.

Thanks

Lekensteyn · Accepted Answer · 2011-06-01T13:15:58.627

17

You can rewrite robots.txt to an other file (let's name this 'robots_no.txt' containing:

User-Agent: *
Disallow: /

(source: http://www.robotstxt.org/robotstxt.html)

The .htaccess file would look like this:

RewriteEngine On
RewriteCond %{HTTP_HOST} !^www.example.com$
RewriteRule ^robots.txt$ robots_no.txt

Use customized robots.txt for each (sub)domain:

RewriteEngine On
RewriteCond %{HTTP_HOST} ^www.example.com$ [OR]
RewriteCond %{HTTP_HOST} ^sub.example.com$ [OR]
RewriteCond %{HTTP_HOST} ^example.com$ [OR]
RewriteCond %{HTTP_HOST} ^www.example.org$ [OR]
RewriteCond %{HTTP_HOST} ^example.org$
# Rewrites the above (sub)domains <domain> to robots_<domain>.txt
# example.org -> robots_example.org.txt
RewriteRule ^robots.txt$ robots_${HTTP_HOST}.txt [L]
# in all other cases, use default 'robots.txt'
RewriteRule ^robots.txt$ - [L]

Instead of asking search engines to block all pages on for pages other than www.example.com, you can use <link rel="canonical"> too.

If http://example.com/page.html and http://example.org/~example/page.html both point to http://www.example.com/page.html, put the next tag in the <head>:

<link rel="canonical" href="http://www.example.com/page.html">

See also Googles article about rel="canonical"

edited Jun 01 '11 at 13:15

answered Oct 05 '10 at 06:53

Lekensteyn

64,486
22
159
192

Lekensteyn, that looks good. However, I had a small doubt. Is it possible to allocate different robot.txt files based on url. Haven't been able to locate such a rule. If you could point me to such tuts, it would be helpful..thanks.. – Loveleen Kaur Oct 06 '10 at 03:49
What did you mean by 'based on url'? If you meant 'domain', look at the above example. Rewrite guide: http://httpd.apache.org/docs/current/rewrite/rewrite_intro.html. An other way to achieve different `robots.txt` for each domain is by using a serverscript, PHP for example. – Lekensteyn Oct 06 '10 at 07:21
@Lekensteyn It's ok for domain but How can prevent folder using .htaccess (Without robots.txt)? – Nullpointer Jun 30 '16 at 09:15
@RaviG. Could you rephrase that and create a new question? It's unclear what you are asking for. Did you mean "How to prevent search engines from indexing folders such as /admin/?" – Lekensteyn Jun 30 '16 at 10:30
@Lekensteyn It's my mistake and slimier que. ; Can I prevent indexing using .htaccess only (Without robots.txt file) ? – Nullpointer Jun 30 '16 at 10:37
@RaviG. The robots.txt file is for well-behaved bots. If you would like to block specific bots, you could try to [match the user agents](https://stackoverflow.com/q/10735766/427545). Both the robots.txt and user agent blacklisting approach can however be bypassed. If you would like to prevent unauthorized access, consider using some form of authentication (login form, HTTP authentication, TLS client certificates, IP whitelists, etc.). – Lekensteyn Jun 30 '16 at 11:54

score 1 · Answer 2 · answered Mar 14 '19 at 19:56

Just for .htaccess:

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} AltaVista [OR]
RewriteCond %{HTTP_USER_AGENT} Googlebot [OR]
RewriteCond %{HTTP_USER_AGENT} msnbot [OR]
RewriteCond %{HTTP_USER_AGENT} Slurp
RewriteRule ^.*$ "http\:\/\/htmlremix\.com" [R=301,L]

How to block search engines from indexing all urls beginning with origin.domainname.com

2 Answers2

Linked