How did harmless crawler bypass WebForms authentication, and hijack a user's session?

Question

Last night a customer called, frantic, because Google had cached versions of private employee information. The information is not available unless you log in.

They had done a Google search for their domain, e.g.:

site:example.com

and noticed that Googled had crawled, and cached, some internal pages.

Looking at the cached versions of the pages myself:

This is Google's cache of https://example.com/(F(NSvQJ0SS3gYRJB4UUcDa1z7JWp7Qy7Kb76XGu8riAA1idys-nfR1mid8Qw7sZH0DYcL64GGiB6FK_TLBy3yr0KnARauyjjDL3Wdf1QcS-ivVwWrq-htW_qIeViQlz6CHtm0faD8qVOmAzdArbgngDfMMSg_N4u45UysZxTnL3d6mCX7pe2Ezj0F21g4w9VP57ZlXQ_6Rf-HhK8kMBxEdtlrEm2gBwBhOCcf_f71GdkI1))/ViewTransaction.aspx?transactionNumber=12345. It is a snapshot of the page as it appeared on 15 Sep 2013 00:07:22 GMT

I was confused by the long URL. Rather than:

https://example.com/ViewTransaction.aspx?transactionNumber=12345

there was a long string inserted:

https://example.com/[...snip...]/ViewTransaction.aspx?transactionNumber=12345

It took me a few minutes to remember: that might be a symptom of ASP.net's "cookie-less sessions". If your browser does not support Set-Cookie, the web-site will embed a cookie in the URL.

Except our site doesn't use that.

And even if our site did have cookie-less sessions auto-detected, and Google managed to cajole the web-server into handing it a session in the URL, how did it take over another user's session?

Yes, Google a non-malicious bot hijacked a session

The site has been crawled by bots for years. And this past May 29 was no different.

Google usually starts its crawl by checking the robots.txt file (we don't have one). But nobody is allowed to ready anything on the site (including robots.txt) without first being authenticated, so it fails:

Time      Uri                      Port  User Name         Status
========  =======================  ====  ================  ======
1:33:04   GET /robots.txt          80                      302    ;not authenticated, see /Account/Login.aspx
1:33:04   GET /Account/Login.aspx  80                      302    ;use https plesae
1:33:04   GET /Account/Login.aspx  443                     200    ;go ahead, try to login

All that time, Google was looking for a robots.txt file. It never got one. Then it returns to try to crawl the root:

Time      Uri                      Port  User Name         Status
========  =======================  ====  ================  ======
1:33:04   GET /                    80                      302    ;not authenticated, see /Account/Login.aspx
1:33:04   GET /Account/Login.aspx  80                      302    ;use https plesae
1:33:04   GET /Account/Login.aspx  443                     200    ;go ahead, try to login

And another check of robots.txt on the secure site:

Time      Uri                      Port  User Name         Status
========  =======================  ====  ================  ======
1:33:04   GET /robots.txt          443                     302    ;not authenticated, see /Account/Login.aspx
1:33:04   GET /Account/Login.aspx  443                     200    ;go ahead, try to login

And then the stylesheet on the login page:

Time      Uri                      Port  User Name         Status
========  =======================  ====  ================  ======
1:33:04   GET /Styles/Site.css     443                     200

And that's how every crawl from GoogleBot, msnbot, and BingBot works. Robots, login, secure, login. Never getting anywhere, because it cannot get past WebForms Authentication. And all is well with the world.

Until one day; out of nowhere

Until one day, GoogleBot shows up, with a Session cookie in hand!

Time      Uri                        Port  User Name            Status
========  =========================  ====  ===================  ======
1:49:21   GET /                      443   jatwood@example.com  200    ;they showed up logged in!
1:57:35   GET /ControlPanel.aspx     443   jatwood@example.com  200    ;now they're crawling that user's stuff!
1:57:35   GET /Defautl.aspx          443   jatwood@example.com  200    ;back to the homepage
2:07:21   GET /ViewTransaction.aspx  443   jatwood@example.com  200    ;and here comes the private information

The user, jatwood@example.com had not been logged in for over a day. (I was hoping that IIS had giving the same session identifier to two simultaneous visitors, separated by an application recycle). And our site (web.config) is not configured to enable session-less cookies. And the server (machine.config) is not configured to enable session-less cookies.

So:

how did Google get ahold of a sessionless cookie?
how did Google get ahold of a valid sessionless cookie?
how did Google get ahold of a valid sessionless cookie that belonged to another user?

As recently as October 1 (4 days ago), the GoogleBot was still showing up, cookie in hand, logging in as this user, crawling, caching, and publishing, some of their private details.

How is ~~Google~~ a non-malicious web-crawler bypassing WebForms authentication?

IIS7, Windows Server 2008 R2, single server.

Theories

The server is not configured to give out cookieless sessions. But ignoring that fact, how can Google bypass authentication?

GoogleBot is visiting the web-site, and attempting random usernames and passwords (not likely, the logs show no attempts to login)
GoogleBot decided to insert a random cookieless session into the URL string, and it happened to match the session of an existing user (not likely)
The user managed to figure out how to make an IIS web-site return a cookieless URL (not likely), then pasted that URL onto another web-site (not likely), where Google found the cookieless URL and crawled it
The user is running through mobile proxy (which they're not). The proxy server doesn't support cookies, so IIS creates a cookieless session. That (e.g. Opera Mobile) caching server was breached (not likely) and all cached links posted on a hacker forum. GoogleBot crawled the hacker forum, and started following all links; including our jatwood@example.com cookieless session URL.
The user has a virus, which manages to cajole any IIS web-servers into handing back a cookieless URL. That virus then reports back to headquarters. The URLs are posted onto a publicly accessible resource that GoogleBot crawl. GoogleBot then shows up at our server with the cookieless URL.

None of these are really plausible.

How can ~~Google~~ a non-malicious web-crawler bypass WebForms authentication, and hijack a user's existing session?

What are you asking?

I don't even know how an ASP.net web-site, that is not configured to give out cookieless-sessions, could give out cookieless session. Is it possible to back-convert a cookie-based session id into a cookieless-based session id? I could quote the relevant <sessionState> section of web.config and machine.config, and show there is no presence of

<sessionState cookieless="true">

How does the web-server decide that the browser doesn't support cookies? I tried blocking cookies in Chrome, and I was never given a cookie-less session identifier. Can I simulate a browser that doesn't support cookies, in order to verify that my server is not giving out cookieless sessions?

Does the server decide cookieless sessions by User-Agent string? If so, I could set Internet Explorer with a spoofed UA.

Does session identity in ASP.net depend solely on the cookie? Can anyone, from any IP, with the cookie-url, access that session? Does ASP.net not, by default, also take into account?

If ASP.net does tie IP address with the session, wouldn't that mean that the session couldn't have originated from the employee at their home computer? Because then when the GoogleBot crawler tried to use it from a Google IP, it would have failed?

Has there been any instances anywhere (besides the one I linked) of ASP.net giving out cookieless sessions when it's not configured to? Is there a Microsoft Connect issue on this?

Is Web-Forms authentication known to have issues, and should not be used for security?

Bonus Reading

A guy on StackOverflow who's web-server is sometimes giving out cookieless URLs when it's not configured to

I removed the name of ~~Google~~ the bot that bypassed privilege, as people are confusing ~~Google~~ the name of the crawler for something else. I use ~~Google~~ the name of the crawler as a reminder that it was a non-malicious web-crawler that managed to crawl its way into another user's WebForm's session. This is to contrast it with a malicious crawler, that was trying to break into another user's session.

You've got a problem. Whether or not it is Google doesn't matter. Your site obviously is not secure. Instead of posting complaints and (unproven) accusations leveled at Google, why not tell us a little bit about your site and maybe we can help you figure out what you did wrong? — John Wu, Oct 04 '13 at 23:32
By the way, what is "jatwood@example.com" in your list? Please don't tell me that's the session ID!!! — John Wu, Oct 04 '13 at 23:34
It seems that when you visit the page with Chrome (or maybe other browsers with google stuff added), the URL you visit is passed to Google for indexing. We had the same with our corporate server residing on confidential address and port (and of course no external links to that server). Nevertheless, your question is offtopic on SO. — Eugene Mayevski 'Callback, Oct 05 '13 at 07:54
@JohnWu i wasn't blaming GoogleBot of breeching my site; i was asking how is it that even a dumb *robot* can get into the site. And if you've ever looked at [W3C logs in IIS you'll notice the `cs-username` column](http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/676400bc-8969-4aa7-851a-9319490a9bbb.mspx?mfr=true). Rather than quote the IIS log, revealing the user's e-mail address/domain login, i changed their e-mail address/login name to a different, made up, e-mail address/login name. — Ian Boyd, Oct 05 '13 at 14:15
@EugeneMayevski'EldoSCorp Correctly doing WebForms authentication in ASP.net is off-topic for a programming site? — Ian Boyd, Oct 05 '13 at 14:16
@EugeneMayevski'EldoSCorp i wrote the question yesterday, and re-read it today. Which parts of the question are you misinterpreting; perhaps i can change them until you're satisfied the tone of the question is good enough to be answered. — Ian Boyd, Oct 05 '13 at 16:59
This is interesting and very scary. Many times I have seen the results on google search lead to pages which require authentication. On visiting the site/link you will see a message asking log in. But still google shows results as if it is logged in. You can see the captured content in cache/history, but how did it get the stuff in the first place ?? That can happen only if google bot had session/cookies somehow or googlebot had access to the copied authenticated stuff. — user568109, Oct 11 '13 at 10:41
@user568109 Looking at the logs, GoogleBot *did* have a session/cookie. In this case the session cookie is encoded in the URL. But it wasn't as though GoogleBot cajoled the server into giving it someone's session; it **show up** with a valid session cookie *"already in hand"*. — Ian Boyd, Oct 11 '13 at 13:57
Can you please post your web.config or post the section that contains the cookieless settings? — John Wu, Oct 12 '13 at 02:29
Excellent question. I'll dig into the ASP.Net source to see what might trigger a cookieless session ID (perhaps disabling cookies is enough with certain settings?). Even if this comes down to an error on your part, I want to know what the error is so I can avoid it. — Tim M., Oct 12 '13 at 02:45

score 10 · Accepted Answer · edited May 23 '17 at 11:44

Though the question mainly references session identifiers, the length of the identifier struck me as unusual.

There are at least two types of cookie/cookieless operations that can modify the query string to include an ID.

Cookieless sessions
Cookieless forms authentication tokens

They are completely independent of each other (as far as I can tell).

Session State

A cookieless session allows the server to access session state data based on a unique ID in the URL versus a unique ID in a cookie. This is usually considered a fine practice, though ASP.Net reuses session IDs which makes it more prone to session fixation attempts (separate topic but worth knowing about).

Does session identity in ASP.net depend solely on the cookie? Can anyone, from any IP, with the cookie-url, access that session? Does ASP.net not, by default, also take into account?

The session ID is all that is required.

General Session Security Reading

Forms Authentication

Based on the length of the example data, I'm guessing your URL actually contains a forms authentication value, not a session ID. The source code suggests that cookieless mode is not something you must explicitly enable.

/// <summary>ASP.NET determines whether to use cookies based on
/// <see cref="T:System.Web.HttpBrowserCapabilities" /> setting. 
/// If the setting indicates that the browser or device supports cookies, 
/// cookies are used; otherwise, an identifier is used in the query string.</summary>
UseDeviceProfile

Here's how the determination is made:

// System.Web.Security.CookielessHelperClass
internal static bool UseCookieless( HttpContext context, bool doRedirect, HttpCookieMode cookieMode )
{
    switch( cookieMode )
    {
        case HttpCookieMode.UseUri:
            return true;
        case HttpCookieMode.UseCookies:
            return false;
        case HttpCookieMode.AutoDetect:
            {
                // omitted for length
                return false;
            }
        case HttpCookieMode.UseDeviceProfile:
            if( context == null )
            {
                context = HttpContext.Current;
            }
            return context != null && ( !context.Request.Browser.Cookies || !context.Request.Browser.SupportsRedirectWithCookie );
        default:
            return false;
    }
}

Guess what the default is? HttpCookieMode.UseDeviceProfile. ASP.Net maintains a list of devices and capabilities. This list is generally a very bad thing; for example, IE11 gives a false positive for being a downlevel browser on par with Netscape 4.

Causes

I think Gene's explanation is very likely; Google found the URL from some user action and crawled it.

It's completely conceivable that the Google bot is deemed to not support cookies. But this doesn't explain the origin of the URL, i.e. what user action resulted in Google seeing a URL with an ID already in it? A simple explanation could be a user with a browser that was deemed to not support cookies. Depending on the browser, everything else could look fine to the user.

The timing, i.e. the duration of validity seems long, though I'm not that familiar with how long the authentication tickets are valid and under what circumstances they could be renewed. It's entirely possible ASP.Net continued to reissue/renew tickets as it would do for a continually active user.

Possible Solutions

I'm making a lot of assumptions here, but If I'm correct:

First, reproduce the behavior in your environment.

Explicitly disable cookieless behavior by using HttpCookieMode.UseCookies.

web.config:

 <authentication mode="Forms">
    <forms loginUrl="~/Account/Login.aspx" name=".ASPXFORMSAUTH" timeout="26297438"
           cookieless="UseCookies" />
 </authentication>

While this should resolve the behavior, you might investigate extending the forms authentication HTTP module and adding additional validation (or at least logging/diagnostics).

Using Internet Explorer's `F12` tools, i set my **User-Agent** string to a known browser that doesn't support cookies. (The .NET database contains a useful `Generic Downlevel` user agent string that stimulates this failure mode). i logged into the customer's live, internet-facing, web-site, and **was** given the *"cookie-in-url"* URL. i sent the long URL to a colleague. From his ("generic downlevel" configured IE) he was immediately logged in. Given that we have `cookieless=false`, this was maddening. Your insight into separate *session* vs *asp.net forms state* is probably the answer. — Ian Boyd, Oct 12 '13 at 16:30
And that did it. There's [``](http://msdn.microsoft.com/en-us/library/h6bb9cz9(v=vs.85).aspx), and there's [``](http://msdn.microsoft.com/en-us/library/system.web.security.formsauthentication.cookiemode.aspx). One is off by default, the other is **not** off by default. And the one that's not-off-by-default is the one that is important. — Ian Boyd, Oct 12 '13 at 16:41

Gene · Answer 2 · 2013-10-10T23:09:29.333

You asked for thoughts, so I'll give some. No warranty expressed or implied.

Give up the idea that your site is configured not to encode session information in URIs. With very high probability it did so. Either you're wrong about the configuration or (more likely) there's a bug that caused it to do so.

That leaves the central question: how Google obtained the session URI?

You didn't say anything about the customer base. Here's a guess:

A customer logged into the system in a way that produced a URI encoding of the session, then emailed this using a gmail account to someone else. Google scanned the email and provided the URI to the crawler bot.

There are other, similar ways that a customer whose client produced the URI could inadvertently surrender it to Google. Google Drive document. Google Plus posting. Etc.

Google may not be evil, but they're nonetheless everywhere. Their use agreement lets them move links across product boundaries, in this case mail (etc.) to search.

The real question you should be thinking about is why your site is not protected from cross-site request forgery. The Rails folks explain this pretty nicely. The Rails protect_from_forgery mechanism would have prevented the reported problem.

A related question is why the encoded cookie (apparently) never expires. It ought to be easy to make sessions contain timestamps to make this so.

Wow. That concerns me about URLs crossing product boundaries. I was going to suggest install [Google webmaster tools](http://www.google.com/webmasters/tools/) to track down how the crawler is getting referred to the site, but I guess that could result in more Google leakage. — Franklin P Strube, Oct 17 '13 at 01:33