4

My SPA employs the Backbone.js router which uses pushstate and hashed URLs as a fallback method. I intend to use Google's suggestion for making an AJAX web-app crawlable. That is, I want to index my site into static .html files generated by PhantomJS and deliver them to Google via the URL:

mysite.com/?_escaped_fragment_=key=value.

Keep in mind that the site does not serve static pages for end-users (it only works with a Javascript-enabled browser). If you navigate to mysite.com/some/url the .htaccess file is setup to always serve up mysite.com/index.php and the backbone router will read the URL in order to display the JavaScript-generated content for that URL.

Furthermore, so that Google will index my entire site, I plan on creating a sitemap which will be a list of hashbang URLs. The URLs must be hashbanged so that Google will know to index the site using the _escaped_fragment_key URL.

Soooo....

(1) Will this approach work?

and

(2) Since backbone.js does not use hashbang URLs, how can I convert the hashbang URL to the pushstate URL for when the user arrives via Google?

reference: https://stackoverflow.com/a/6194427/1102215

Community
  • 1
  • 1
Gil Birman
  • 35,242
  • 14
  • 75
  • 119

2 Answers2

3

I ended up stumbling through the implementation as I've outlined in my questions. So...

(1) Yes, the approach seems to work rather well. The only downside is that even though the app works without hash-bangs, my sitemap.xml is full of hashbang URLs. This is necessary to tip-off Google to the fact that it should query the _escaped_fragment_ URL when crawling these pages. So when the site appears in Google search results there is a hashbang in the URL, but that's a small price to pay.

(2) This part was a lot easier than I had imaged. It only required one line of code before initializing the Backbone.js router...

window.location.hash = window.location.hash.replace(/#!/, '#');

var AppRouter = Backbone.Router.extend({...

After the hashbang is replaced with just a hash, the backbone router will automatically remove the hash for browsers that support pushState. Furthermore, those two URL state changes are not saved in the browser's history state, so if the user clicks the back button there is no weirdness/unexpected redirects.

UPDATE: A better approach

It turns out that there is a dead simple approach which completely does away with hashbangs. Via BromBone:

If your site is using hashbangs (#!) urls, then Google will crawl your site by replacing #! with ?escaped_fragment=. When you see ?escaped_fragment=, you'll know the request is from a crawler. If you're using html5 pushState, then you look at the "UserAgent" header to determine if the request is from a bot.

This is a modified version of BromBone's suggested .htaccess rewrite rules:

    RewriteEngine On
    RewriteCond $1 !\.(gif|jpe?g|png)$ [NC]
    RewriteCond %{REQUEST_FILENAME} !-f
    RewriteCond %{REQUEST_FILENAME} !-d
    RewriteCond %{HTTP_USER_AGENT} .*Googlebot.* [OR]
    RewriteCond %{HTTP_USER_AGENT} .*Bingbot.* [OR]
    RewriteCond %{HTTP_USER_AGENT} .*Baiduspider.* [OR]
    RewriteCond %{HTTP_USER_AGENT} .*iaskspider.*
    RewriteRule ^(.*)$ snapshot.php/$1 [L]
Gil Birman
  • 35,242
  • 14
  • 75
  • 119
  • a quick comment. You cannot say with certainty the user agent will have the spider name in the UA string. THey will often disguise themselves as a regular browser's UA string to see what happens. This approach would help. But for you to serve the 'core' site content you need to have the route sent to the server and the route is not sent to the server when the # is being used. My worry is this would only serve the home view's content and not any deep linked content. Does that make sense? – Chris Love Dec 21 '13 at 02:08
  • Chris, are you talking about the RewriteRule? The actual rewrite rule I'm using is `RewriteRule ^(.*)$ snapshot.php/$1 [L]` ... I've updated this answer to reflect that – Gil Birman Dec 21 '13 at 04:21
  • also, google will not seek out the URL with the #. All of the URLs in the sitemap look like pushState URLs. – Gil Birman Dec 21 '13 at 04:28
1

Let me summarize something I wrote about 10 pages in my upcoming book on SPA. Google wants a classic version of your site. This is also an advantage because obsolete browsers really cant do SPA effectively anyway. Serve the spiders and old browsers a core site.

I get the term from the Gaurdian newspaper, http://vimeo.com/channels/smashingconf.

In the browser check if the browser cuts the mustard, here is my script for doing this:

<script>

    if (!('querySelector' in document)
         || !('localStorage' in window)
         || !('addEventListener' in window)
        || !('matchMedia' in window)) {

        if (window.location.href.indexOf("#!") > 0) {
            window.location.href = window.location.href.replace("#!", "?_escaped_fragment_=");
        } else {
            if (window.location.href.indexOf("?_escaped_fragment_=") < 0) {
                window.location.href = window.location.href + "?_escaped_fragment_=";
            }
        }

    } else {

        if (window.location.href.indexOf("?_escaped_fragment_=") >= 0) {
            window.location.href = window.location.href.replace("?_escaped_fragment_=", "#!");
        }
    }

</script>

On the server you need some mechanism to check for the presence of the _escape_fragment_ querystring. If it is present you need to serve the core site. The core site only uses simple CSS and little or no JavaScript. I have a SPAHelper library for ASP.NET MVC you can check out to see some things I implements around this, https://github.com/docluv/spahelper.

The real issue is most server-side web frameworks like ASP.NET, PHP, etc are not designed to support a single view system for the client and server. So you are sort of stuck maintaining two views for this. Again I wrote about 10 pages around this topic for my book, which should be ready sometime next week.

Chris Love
  • 3,740
  • 1
  • 19
  • 16
  • Chris, thank you for the response. In my question I wrote that there is no non-JavaScript version of the site. IOW, old browsers are SOL. That's by design because this is a map-based app. Also, redirecting to an _escaped_fragment_ URL defeats the purpose because the hashbang URLs are in the sitemap.xml and therefore those are the URLs that Google will send the user to (when the site shows up in Google's search results). – Gil Birman Dec 19 '13 at 23:22
  • If you read the Google guidelines it requires the document be generated on the server, like a classic site. Hence why you use the escape fragment query string variable. You have to serve the core site to the spider to meet the criteria. the #! fragment does not get sent to the server, hence the ?. The site map should have the #! version, the spider knows to convert that url to the querystring version. – Chris Love Dec 19 '13 at 23:41
  • I'm not disputing what you just said. My point is that doing a JavaScript redirect from a #! URL to an escaped_fragment URL is the wrong way to do it. Google will automatically seek out the escaped_fragment URL when it sees the #!. Furthermore, the redirect will send your users exactly to where they shouldn't be sent to, ie: the static html page intended only for spiders. – Gil Birman Dec 20 '13 at 19:22
  • no I dont think you are understanding what I was saying. I decided to take advantage of needing a core site for the search engine. Because I need that core site I re-purpose the core site to serve to out of date browsers. If the visitor is using an obsolete browser then give them an experience that works in that browser. Don't go out of your way to create a very complicated solution to make your modern experience work in an old browser. It is a lot of work and polyfils to make something work in environments that should not exist within a few years. – Chris Love Dec 20 '13 at 20:44
  • 1
    OK, I think I see what you're saying. The statement **if (window.location.href.indexOf("#!") > 0)...** will only be evaluated for older browsers. That wasn't so clear from your answer. If that's the case then what you're doing makes perfect sense. – Gil Birman Dec 20 '13 at 22:27
  • sweet, If I am writing a book about it I need to make sure the concept is understandable :) – Chris Love Dec 21 '13 at 02:04