8

I'm trying to restrict access to a S3 bucket and only allowing certain domains from a list based on the referer.

The bucket policy is basically:

{
"Version": "2012-10-17",
"Id": "http referer domain lock",
"Statement": [
    {
        "Sid": "Allow get requests originating from specific domains",
        "Effect": "Allow",
        "Principal": "*",
        "Action": "s3:GetObject",
        "Resource": "arn:aws:s3:::example.com/*",
        "Condition": {
            "StringLike": {
                "aws:Referer":  [ 
                    "*othersite1.com/*",
                    "*othersite2.com/*",
                    "*othersite3.com/*"
                ]
            }
        }
    }
 ]
}

This othersite1,2 and 3 call an object that i have stored in my s3 bucket under the domain example.com. I also have a cloudfront distribution attached to the bucket. I'm using * wildcard before and after the string condition. The referer can be othersite1.com/folder/another-folder/page.html. The referer may also use http or https.

I don't know why I'm getting 403 Forbidden error.

I'm doing this basically because i don't want other sites to call that object.

Any help would be greatly appreciated.

esdrayker
  • 115
  • 1
  • 9
  • I am not sure if leading wildcards are supported. I'd try removing them and test again. – Cagatay Gurturk Sep 02 '17 at 13:27
  • In the CloudFront cache behavior, did you configure the `Referer` header for whitelisting so that it is forwarded to S3? Any header not forwarded will not be usable by the bucket policy, which will be your first problem... though the issue is more complex than that, because your cache hit rate will not be as high if you forward the `Referer`. – Michael - sqlbot Sep 02 '17 at 14:26
  • @ÇağatayGürtürk thanks for the comment, I did try removing, and even writing the exact referer as I see it in the chrome inspector, but still not working – esdrayker Sep 02 '17 at 19:11
  • @Michael-sqlbot I did not configure the 'referer' header for whitelisting. I will do that! Thank you! Hopefully this will work – esdrayker Sep 02 '17 at 19:13
  • Note also that after chaging that, you will want to [set the Error Caching TTL to 0 for 403 errrors](https://stackoverflow.com/a/35541525/1695906) and then do an invalidation of `/*` (all objects), and then let everything settle for a few minutes before testing again, or you will likely get cached 403 responses, which can be frustrating if you've actually fixed the problem but testing still fails for a few minutes. An `Age:` response header tells you how long (in seconds) a particular response has been in the cache, and will be absent if a response was not cached. – Michael - sqlbot Sep 02 '17 at 20:40
  • @Michael I did exactly that and it worked beautifully. I was having those kind of issues when testing before, even after creating the invalidations. Thank you for the answer. I don't know how to settle this, but yours was the solution. – esdrayker Sep 03 '17 at 01:16
  • I'll write up an answer and give you another option, as well. – Michael - sqlbot Sep 03 '17 at 01:29
  • @Michael about what you said that the issue is more complex because cache hits will not be as high, I understand why. My question is, the object gets called by this domains only. This is just to prevent some other site taking advantage of this object, which will mess up our metrics and eventually having costly impact. If the domains are constantly calling the object, and we keep all flavors on the edge, can we eventually increase performance? Thanks and sorry for the long question. – esdrayker Sep 03 '17 at 01:29
  • It will be a few hours before I am able to provide a full answer, but it will address this question. The issue is that referer is the referring *page*, not the referring *domain*, so CloudFront will be caching a "different" copy of your objects for each unique page that requests the object. This doesn't come with a direct cost, since CloudFront doesn't charge for storage, but it reduces the likelihood of cache hits, at least to some extent. More to come, after I have an opportunity to assemble all the details. – Michael - sqlbot Sep 03 '17 at 03:46

1 Answers1

16

As is necessary for correct caching behavior, CloudFront strips almost all of the request headers off of a request before forwarding it to the origin server.

Referer | CloudFront removes the header.

http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/RequestAndResponseBehaviorCustomOrigin.html#request-custom-headers-behavior

So, if your bucket is trying to block requests based on the referring page, as is sometimes done to prevent hotlinking, S3 will not -- by default -- be able to see the Referer header, because CloudFront doesn't forward it.

And, this is a very good illustration of why CloudFront doesn't forward it. If CloudFront forwarded the header and then blindly cached the result, whether the bucket policy had the intended effect would depend on whether the first request was from one of the intended sites, or from elsewhere -- and other requesters would get the cached response, which might be the wrong response.

(tl;dr) Whitelisting the Referer header for forwarding to the origin (in the CloudFront Cache Behavior settings) solves this issue.

But, there is a bit of a catch.

Now that you are forwarding the Referer header to S3, you've extended the cache key -- the list of things against which CloudFront caches responses -- to include the Referer header.

So, now, for each object, CloudFront will not serve a response from cache unless the incoming request's Referer header matches exactly one from an already-cached request... otherwise the request has to go to S3. And, the thing about the referer header, it's the referring page, not the referring site, so each page from the authorized sites will have its own cached copy of these assets in CloudFront.

This, itself, is not a problem. There is no charge for these extra copies of objects, and this is how CloudFront is designed to work... the problem is, it reduces the likelihood of a given object being in a given edge cache, since each object will necessarily be referenced less. This becomes less significant -- to the point of insignificance -- if you have a large amount of traffic, and more significant if your traffic is smaller. Fewer cache hits means slower page loads and more requests going to S3.

There is not a correct answer to whether or not this is ideal for you, because it is very specific to exactly how you are using CloudFront and S3.

But, here's the alternative:

You can remove the Referer header from the whitelist of headers to forward to S3 and undo that potential for negatively impacting cache hits, by configuring CloudFront to fire a Lambda@Edge Viewer Request trigger that will inspect each request as it comes in the front door, and block those requests that don't come from referring pages that you want to allow.

A Viewer Request trigger fires after the specific Cache Behavior is matched, but before the actual cache is checked, and with most of the incoming headers still intact. You can allow the request to proceed, optionally with modifications, or you can generate a response and cancel the rest of the CloudFront processing. That's what I'm illustrating, below -- if the host part of the Referer header isn't in the array of acceptable values, we generate a 403 response; otherwise, the request continues, the cache is checked, and the origin consulted only as needed.

Firing this trigger adds a small amount of overhead to every request, but that overhead may amortize out to being more desirable than a reduced cache hit rate. So, the following is not a "better" solution -- just an alternate solution.

This is a Lambda function written in Node.js 6.10.

'use strict';

const allow_empty_referer = true;

const allowed_referers = ['example.com', 'example.net'];

exports.handler = (event, context, callback) => {

    // extract the original request, and the headers from the request
    const request = event.Records[0].cf.request;
    const headers = request.headers;

    // find the first referer header if present, and extract its value;
    // then take http[s]://<--this-part-->/only/not/the/path.
    // the || [])[0]) || {'value' : ''} construct is optimizing away some if(){ if(){ if(){ } } } validation

    const referer_host = (((headers.referer || [])[0]) || {'value' : ''})['value'].split('/')[2];

    // compare to the list, and immediately allow the request to proceed through CloudFront 
    // if we find a match

    for(var i = allowed_referers.length; i--;)
    {
        if(referer_host == allowed_referers[i])
        {
            return callback(null,request);
        }
    }

    // also test for no referer header value if we allowed that, above
    // usually, you do want to allow this

    if(allow_empty_referer && referer_host === "")
    {
        return callback(null,request);
    }

    // we did not find a reason to allow the request, so we deny it.

    const response = {
        status: '403',
        statusDescription: 'Forbidden',
        headers: {
            'vary':          [{ key: 'Vary',          value: '*' }], // hint, but not too obvious
            'cache-control': [{ key: 'Cache-Control', value: 'max-age=60' }], // browser-caching timer
            'content-type':  [{ key: 'Content-Type',  value: 'text/plain' }], // can't return binary (yet?)
        },
        body: 'Access Denied\n',
    };

    callback(null, response);
};
Michael - sqlbot
  • 169,571
  • 25
  • 353
  • 427
  • Thank you @Micheal this is very helpful. As you said the solution depends on the use case. We do have a lot of traffic, but we also have lots of different referers. So now we have to look at the tradeoffs between cost and performance. Looking at Lambda prices sims like a good option even for high traffic, but it does add a little overhead. On the other hand the bucket policy sims not as scalable if the objective is to keep adding domains and growing the policies to complex. It is very interesting indeed. Thanks again! – esdrayker Sep 05 '17 at 23:09