Avoiding duplicate-content hit on Google for archive pages?

Question

Each blog post on my site -- http://www.correlated.org -- is archived at its own permalinked URL.

On each of these archived pages, I'd like to display not only the archived post but also the 10 posts that were published before it, so that people can get a better sense of what sort of content the blog offers.

My concern is that Google and other search engines will consider those other posts to be duplicate content, since each post will appear on multiple pages.

On another blog of mine -- http://coding.pressbin.com -- I had tried to work around that by loading the earlier posts as an AJAX call, but I'm wondering if there's a simpler way.

Is there any way to signal to a search engine that a particular section of a page should not be indexed?

If not, is there an easier way than an AJAX call to do what I'm trying to do?

Have the same problem on an Site where we expand articles on the start and archive pages. We dont use hashbangs but history.pushState when we have expanded the content — snobojohan, Aug 31 '11 at 06:49
Possible duplicate: http://stackoverflow.com/questions/3207211/is-there-a-way-to-make-robots-ignore-certain-text — Ben Regenspan, Aug 31 '11 at 06:59
Not exactly what you're looking for, but it might be clearer for users AND search engines if instead of posting the full other articles, you just post their titles and a short excerpt and link to them. — Ben Regenspan, Aug 31 '11 at 07:00

score 5 · Accepted Answer · answered Sep 01 '11 at 09:24

Caveat: this hasn't been tested in the wild, but should work based on my reading of the Google Webmaster Central blog and the schema.org docs. Anyway...

This seems like a good use case for structuring your content using microdata. This involves marking up your content as a Rich Snippet of the type Article, like so:

   <div itemscope itemtype="http://schema.org/Article" class="item first">
      <h3 itemprop="name">August 13's correlation</h3>        
      <p itemprop="description" class="stat">In general, 27 percent of people have never had any wisdom teeth extracted. But among those who describe themselves as pessimists, 38 percent haven't had wisdom teeth extracted.</p>
      <p class="info">Based on a survey of 222 people who haven't had wisdom teeth extracted and 576 people in general.</p>
      <p class="social"><a itemprop="url" href="http://www.correlated.org/153">Link to this statistic</a></p>  
   </div>

Note the use of itemscope, itemtype and itemprop to define each article on the page.

Now, according to schema.org, which is supported by Google, Yahoo and Bing, the search engines should respect the canonical url described by the itemprop="url" above:

Canonical references

Typically, links are specified using the element. For example, the following HTML links to the Wikipedia page for the book Catcher in the Rye.
<div itemscope itemtype="http://schema.org/Book">
  <span itemprop="name">The Catcher in the Rye</span>—
  by <span itemprop="author">J.D. Salinger</a>
  Here is the book's <a itemprop="url"
href="http://en.wikipedia.org/wiki/The_Catcher_in_the_Rye">Wikipedia page.

http://schema.org/docs/gs.html#advanced_enum

So when marked up in this way, Google should be able to correctly ascribe which piece of content belongs to which canonical URL and weight it in the SERPs accordingly.

Once you've done marking up your content, you can test it using the Rich Snippets testing tool, which should give you a good indication of what Google things about your pages before you roll it into production.

p.s. the most important thing you can do to avoid a duplicate content penalty is to fix the titles on your permalink pages. Currently they all read 'Correlated - Discover surprising correlations' which will cause your ranking to take a massive hit.

postscript -> Will they "take a hit" because of the duplicate of Correlated/correlations? — Kieran, Sep 07 '11 at 06:27
no @Kieran - they'll take a hit because the title is the same on every permalink page. The title should be unique to each page. See the duplicated titles here: http://www.correlated.org/153, http://www.correlated.org/153 — Ciaran, Sep 07 '11 at 06:30

score 0 · Answer 2 · answered Aug 06 '11 at 14:56

I'm afraid but I think it is not possible to tell a Search Engine that a specif are of your web page should not be be indexed (example a div in your HTML source). A solution to this would be to use an Iframe for the content you do not what search engine to index, so I would use a robot.text file with an appropriate tag Disallow to deny access to that specific file linked to the Iframe.

score 0 · Answer 3 · answered Aug 31 '11 at 13:41

You can't tell Google to ignore portions of a web page but you can serve up that content in such a way that the search engines can't find it. You can either place that content in an <iframe> or serve it up via JavaScript.

I don't like those two approaches because they're hackish. Your best bet is to completely block those pages from the search engines since all of the content is duplicated anyway. You can accomplish that a few ways:

Block your archives using robots.txt. If your archives in are in their own directory then you can block the entire directory easily. You can also block individual files and use wildcards to match patterns.
Use the <META NAME="ROBOTS" CONTENT="noindex"> tag to block each page from being indexed.
Use the X-Robots-Tag: noindex HTTP header to block each page from being indexed by the search engines. This is identical in effect to using the ` tag although this one can be easier to implement since you can use it in a .htaccess file and apply it to an entire directory.

Avoiding duplicate-content hit on Google for archive pages?

3 Answers3