What happens when a denied page (robots) is still in sitemap.xml?

Question

I want to prevent a page from being indexed, along with its assets (images).

So if I tell crawlers to skip that page, but that page is still registered in sitemap.xml, will any information on that page be indexed?

Depends. On how you actually “tell crawlers to skip that page”, whether you specified how those external assets are to be index separately, whether or not a specific crawler wants to respect your instructions, etc. pp. — CBroe, Jun 29 '17 at 09:11
This question appears to be off-topic because it is not within the bounds of discussion as described in the help center. — , Jun 29 '17 at 13:47

score 0 · Answer 1 · answered Jun 29 '17 at 13:28

0

robots.txt disallows crawling, not indexing.

If you disallow crawling of a URL in your robots.txt, and you list this URL in your sitemap, it is still disallowed to be crawled. Occurrence in a sitemap doesn’t change this.

This URL might still be indexed, though (whether it’s in the sitemap or not).

answered Jun 29 '17 at 13:28

unor

92,415
26
211
360

you need to fetch a page in order to index it. if it is disallowed by the robots directives, then it won't be indexed. Not all crawlers follow robots.txt though – Julien Nioche Jun 29 '17 at 21:11
1

@JulienNioche: Nope, you can index a URL (not a page) without fetching the page. Many search engines (including Google Search) do this. You will then typically see a notice like "The site’s robots.txt doesn’t allow us to crawl this page, that is why we can’t show you a description". They might even show a title, taken from hyperlink anchors that linked to it. – unor Jun 29 '17 at 22:12
you're right, I hadn't considered that aspect. Thanks! – Julien Nioche Jun 30 '17 at 07:47

score 0 · Answer 2 · answered Jun 30 '17 at 07:55

Just to add to the previous answer, you can use the Noindex directive in your robots.txt file. It is not part of the standard AFAIK but is commonly used, see blog - although there seem to be diverging opinions about it. Alternatively, you could use the robots meta tags in your webpages.

As usual, there is no guarantee that all crawlers will respect the robots directives, however the main ones will.

What happens when a denied page (robots) is still in sitemap.xml?

2 Answers2