16

I have a php page that renders a book of let's say 100 pages. Each page has a specific url (e.g. /my-book/page-one, /my-book/page-two etc).

When flipping the pages, I change the url using the history API, using url.js.

Since all the book content is rendered from the server side, the problem is that the content is indexed by search engines (especially I'm referring to Google), but the urls are wrong (e.g. it finds a snippet on page-two but the url is page-one).

How to stop search engines (at least Google) to index all the content on the page, but index only the visible book page?

Would it work if I render the content in a different way: for example, <div data-page-number="1" data-content="Lorem ipsum..."></div> and then on the JavaScript side to change that in the needed format? That would make the page slower and in fact I'm not sure if Google will not index the changed content by JavaScript.

The code looks like this:

<div data-page="1">Page 1</div>
<div data-page="2">Page 2</div>
<div data-page="3" class="current-page">Page 3</div>
<div data-page="4">Page 4</div>
<div data-page="5">Page 5</div>

Then only visible div is the .current-page one. The same content is served on multiple urls because that's needed so the user can flip between pages.

For example, /book/page/3 will render this piece of HTML while /book/page/4 renders the same thing, the only difference being the current-page class which is added to the 4th element.

Google did index different urls, but it did it wrong: for example, the snippet Page 5 links to /book/page/2 which renders to the user Page 2 (not Page 5).

How to tell Google (and other search engines) I'm only interested to index the content in the .current-page?

Cœur
  • 37,241
  • 25
  • 195
  • 267
Ionică Bizău
  • 109,027
  • 88
  • 289
  • 474
  • 1
    You can use `robots.txt` to tell Google. AFAIK Google respects it. Most probably it would be better to build a `sitemap.xml` and tell Google what to index and what not. You can also use Google's Webmaster Tools to push the changes and see how Google is crawling your site. – Praveen Kumar Purushothaman May 06 '16 at 09:48
  • The question is *how*? I'm not sure if any of these would work. In short, I serve the same HTML on different urls, but I show only a specific part of it depending on the url. – Ionică Bizău May 06 '16 at 10:00
  • Can you give an Example of wrong url that is wrong indexed ? Or you do the change onClick on the element? – OBender May 08 '16 at 09:29
  • @OBender Let's suppose I have `Hello World` on page `42` (under the url `/my-book/page/42`). It's very possible that Google indexes this content on another url (and obviously another page), for example, `/my-book/page/7`. That happens because I serve the same content on multiple urls. I have no idea how this can be fixed... – Ionică Bizău May 08 '16 at 10:50
  • Do you mean that : /my-book/page/42 and /my-book/page/7 Have the same Content ? – OBender May 08 '16 at 12:40
  • @OBender Exactly. But the visible area to the user is different. – Ionică Bizău May 08 '16 at 12:52

4 Answers4

6

As I understood he issue is that you have same content for many urls. Like:

www.my-awesome-domain.com/my-book/page/42

www.my-awesome-domain.com//my-book/page/7

And the visible content of the page is adjustable by JavaScript, that User Execute when he clicks some elements on your site.

In This case you need to do 2 things:

  1. Mark your URL's as Canonical pages in any of the ways described in this google document: https://support.google.com/webmasters/answer/139066?hl=en
  2. You need add a feature that each page will load to the same state after full page refresh, for example you can use hash parameter when navigating as desiribed in the article here: or here is the overview of the technique

Today google bot is executing JavaScript as announced in their official blog: https://webmasters.googleblog.com/2015/10/deprecating-our-ajax-crawling-scheme.html

So if you achieve proper page behavior when hitting Refresh (F5) and Will specify the canonical pages property, pages will be correctly crawled, and when you will follow the link you will get to the linked page.

If you need more guidance how to do it in url.js Please post another question (so it's will be proper documented for others) and I will be glad to help.

OBender
  • 2,492
  • 2
  • 20
  • 33
  • Can you give me an example how the code would look like? I'm not sure how canonical urls would help here. How to make the link between the url and the right part of the page that is visible? – Ionică Bizău May 08 '16 at 13:18
  • Canonical Url will eliminate penalty for Duplicate Content on many pages, you need make 1 page per books list. and the other will be canonical to this page. What code you use to hide and show the per book content ? I will suggest how to modify it – OBender May 08 '16 at 13:37
  • Let's suppose I have hidden divs and one of them is visible, containing the page content. I'm not sure what you mean by *make 1 page per books list*. – Ionică Bizău May 08 '16 at 15:00
  • ok, So make them visible on page load Regarding "1 page per books list." Do all pages have the same content? Or you have for example a category that have those many div's and then 1 div is displayed per book ? – OBender May 09 '16 at 07:21
  • I can't make them visible because it's not what I want. I want to display one page, depending on the url and then allow the user to navigate through the pages. – Ionică Bizău May 09 '16 at 19:06
  • So As I tell you make only 1 visible per URL but on load of the page via HashTag urls... And change the hash tag on user click also, Had you read this article that I had linked to you http://blog.mgm-tp.com/2011/10/must-know-url-hashtechniques-for-ajax-applications/ ? – OBender May 10 '16 at 08:25
  • More info here : https://github.com/browserstate/history.js/wiki/Intelligent-State-Handling – OBender May 10 '16 at 11:34
  • I still don't understand what you mean: the user does not click anything on the page. After opening the web page, the book page appears on the screen. The user has the possibility to go to the next/prev pages by clicking two buttons. When they do that the page are flipped and the url is updated using HTML5 states. Summarizing, what should I have to do to fix the problem? – Ionică Bizău May 13 '16 at 04:05
  • What do you mean by "HTML5 states"? Make it's unis Hash Tag navigation , this will solve the issue... as I wrote before. – OBender May 13 '16 at 07:30
  • That's not possible because there are a lot of urls that people access, so the book needs to have these urls. HTML5 history states allows us to change the pathname without reloading the page. – Ionică Bizău May 13 '16 at 07:43
  • What I say didn't conflict with it... Each book will have for example: http://www.mysuperbookstore.com/allbooks#bookTitle Where the bookTitle will be your div Identifyer And On Click You will chnage the Url to: http://www.mysuperbookstore.com/allbooks#AnotherBookTitle This way the navigation & back button will work and SEO will work :-) – OBender May 13 '16 at 09:01
  • No, it won't, because we need non-hash urls (which are already heavily used by the users). – Ionică Bizău May 13 '16 at 09:40
  • Maybe I just don't understand your point, but I do not want to use hashes. I need pathnames. – Ionică Bizău May 13 '16 at 11:13
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/111863/discussion-between-obender-and-ionic-bizu). – OBender May 13 '16 at 13:13
5

The answere is really simple: you can't do it. There is no technical possibility to keep the same content under different URLs and ask search engines to index only part of it.

If you are OK with having only one page indexed you can use, as suggested before, canonical URLs. You place the canonical URL that links to the main page on every sub-page.

You may find a "hack" that uses special tags used for Google Search Appliance: googleon and googleoff.

https://www.google.com/support/enterprise/static/gsa/docs/admin/70/gsa_doc_set/admin_crawl/preparing.html

The only issue is this will most likely not work with Google Bot (at least no one will guarantee it will) or any other search engine.

Aleksander Wons
  • 3,611
  • 18
  • 29
  • I may fall back to render the content on user interaction (from JS), so there should be a solution anyways. I'm interested in the best solution. – Ionică Bizău May 10 '16 at 18:43
2

I dont think you will be able to achieve what you are looking for.

I cant see how robots.txt would have any affect. Canonical tags dont work on divs.

Google has spoken about sites like these in the past and made some suggestions for indexing, here are a couple of links that may help :

https://www.seroundtable.com/seo-single-page-12964.html

https://www.seroundtable.com/google-on-crawling-javascript-sites-progressive-web-apps-21737.html

user29671
  • 782
  • 6
  • 13
1

Save the content in a JSON file which you do not render in the HTML. From the server, serve only the correct page: the content which is visible to the user.

When the user clicks the buttons (prev/next page links etc), render using JavaScript the content you have the JSON file and change the url like you're already doing.

That way you know you always serve from the server the right content and the Google bot will obviously index the pages correctly.

GhitaB
  • 3,275
  • 3
  • 33
  • 62
  • 2
    This doesn't seem likely to work. The rise of SPAs have made search engines put a lot of effort into indexing JS generated content. – Quentin Nov 05 '20 at 13:57