4

I have a wordpress site where I want to stop search engines from crawling an entire directory. I know that I can do this in the robots.txt file (in the root of the site) by adding a "Disallow" line for that directory. However...

In the same site I am using the "XML Sitemap" plugin to automatically build and submit a sitemap.xml when any content changes on the site. Unfortunately, there is no way to automatically stop the plugin to from listing pages within the directory that I do not want crawled. Each time I add a new page within that directory I have to manually exclude that page from the sitemap (the plugin allows for this).

My question is what takes precedence...the robots.txt file or the sitemap.xml file? In other words, if a page is listed in the sitemap.xml file will it be crawled by the search engines if its parent directory is disallowed in robots.txt?

lamarant
  • 3,243
  • 2
  • 25
  • 30
  • This is off-topic here; it's not a programming question. Belongs on [webmasters](http://webmasters.stackexchange.com). Voting to move. – Ken White Apr 14 '11 at 17:03
  • 1
    These files serve a different purpose, the robots.txt is used to explicitly block or allow search engine spiders (that obey it) from spidering certain areas of your site. The sitemap.xml is used to give spiders an easy route to all pages on your site and can also contain weights for page importance which search engines can then take into account. In summary if you deny a page in robots.txt but it is listed in sitemap.xml the robots.txt stops this page being crawled and indexed by any search engine spiders that obey it (all the big ones do). – Darryl at NetHosted Apr 14 '11 at 16:48
  • 1
    > if a page is listed in the sitemap.xml file will it be crawled by the search engines if its parent directory is disallowed in robots.txt? - the page will not get crawled, as googlebot is blocked via the robots.txt. - you will see an error in the webmaster tools, telling you that you submitted a url that is blocked via the robots.txt but - as crawling is optional(!!) for indexing, the pages might (and that's a big might) show up in the google SERPs i explained the last aspect in more detail here: http://stackoverflow.com/questions/5537612/pages-not-indexed-by-google/5548511#5548511 – Franz Enzenhofer Apr 16 '11 at 19:32

0 Answers0