2

I've published a website and, due to a misunderstanding not depending on me, I had to block all the pages before indexing. Some of these pages had been already linked on social networks, so to avoid a bad user-experience I've decided to insert the following code into "robots.txt"

User-agent: *
Disallow: *

I've received a "critical problem" alert on webmaster tools and I'm a bit worried about it. In your experience, would it be sufficient (whenever possible) to restore the original "robots.txt"? May the current situation leave consequences (penalizations or similar) on the website if it lasts for long time (and if it does, how can I fix it)?. I'm sorry if the question may sound a bit generic, but I'm not able to find specific answers. Thanks in advance.

Giorgio
  • 1,940
  • 5
  • 39
  • 64
  • The given code is your current robots.txt? Which meaning do you intend (i.e., what should bots be allowed to do or not do)? – unor Apr 13 '14 at 15:02
  • @unor yes, this is my current "robots.txt". My goal (for the moment) is to block all bots, but allow users to click on external links without receiving a blank page. In few words, I would like the website to be accessed only from humans and not from bots. My fear is this situation would compromise webiste reputation in search engines even after Disallow removal, if lasts for a long time. Hope it's a bit clearer now. – Giorgio Apr 13 '14 at 18:12

2 Answers2

1

The "critical problem" occurs because Google cannot index pages on your site with your robots.txt configuration. If you're still developing the site, it is standard procedure to have this robots.txt configuration. Webmaster tools treats your site as if it was in production however it sounds like you are still developing so this is something of a false-positive error message in this case.

Having this robots.txt configuration has no long-term negative effects for search engine ranking, however the longer that search engines are able to access your site the better the ranking will be. For Google it's something like 3 months of stable crawling will earn it some kind of trusted status. So it really depends on the domain and whether or not it has been previously indexed by Google and for how long, but there would still be no long-term consequences at the very most you will have to wait another 3 months to "earn Google's trust" again.

Most social networks will read the robots.txt file as and when the user shares, search engines on the other hand vary in their indexing rate and will take anything from a few hours to a couple of weeks to detect changes in your robots.txt file and update the index.

Hope this helps, if you can provide more details about your circumstances I may be able to help further, but this should at least answer your question.

tpbapp
  • 2,506
  • 2
  • 19
  • 21
  • Hi @tpbapp, thanks for your answer. The domain has never been indexed before. When I ended up development I bought a domain, published all pages and added the domain URL to Google through its form. I also created company profiles on social networks and published some posts containing links to website. Unfortunately, after some days I was required to block Google indexing until some problems were solved. With disallow in robots.txt, I allow users to see pages when clicking links but disable indexing. I hope to remove this block in few weeks, so I hope it will not leave consequences. – Giorgio Apr 12 '14 at 19:15
  • No problem. If it's a fresh domain there will be no consequences I can assure you. This type of situation is exactly what the robots.txt protocol was designed for. Just be aware that as I said it can take a while for search engines to adapt to changes in the robots.txt file. Best of luck to you. – tpbapp Apr 12 '14 at 21:51
1

My goal (for the moment) is to block all bots

Your current robots.txt does not block all bots.

In the original robots.txt specification, Disallow: * means: disallow crawling all URLs that start with *, for example:

  • http://example.com/*
  • http://example.com/****
  • http://example.com/*p
  • http://example.com/*.html

Some parsers don’t follow the original specification and interpret * as wildcard character. For them (and only for them) it would probably mean to block all URLs (where * means: "any character(s)").

In few words, I would like the website to be accessed only from humans and not from bots.

Then you should use:

User-agent: *
Disallow: /
unor
  • 92,415
  • 26
  • 211
  • 360
  • This a good point, and I agree `Disallow: /` is a much better choice, but it's worth pointing out that all of the major search engines can understand wildcards. Using `Disallow: *` may confuse some older robots, but it will still work on Google, Bing, & Ask. – plasticinsect Apr 14 '14 at 21:42