How can I hide a custom origin server from the public when using AWS CloudFront?

Question

I am not sure if this exactly qualifies for StackOverflow, but since I need to do this programatically, and I figure lots of people on SO use CloudFront, I think it does... so here goes:

I want to hide public access to my custom origin server.

CloudFront pulls from the custom origin, however I cannot find documentation or any sort of example on preventing direct requests from users to my origin when proxied behind CloudFront unless my origin is S3... which isn't the case with a custom origin.

What technique can I use to identify/authenticate that a request is being proxied through CloudFront instead of being directly requested by the client?

The CloudFront documentation only covers this case when used with an S3 origin. The AWS forum post that lists CloudFront's IP addresses has a disclaimer that the list is not guaranteed to be current and should not be relied upon. See https://forums.aws.amazon.com/ann.jspa?annID=910

I assume that anyone using CloudFront has some sort of way to hide their custom origin from direct requests / crawlers. I would appreciate any sort of tip to get me started. Thanks.

score 0 · Answer 1 · answered Jan 08 '13 at 20:19

0

I would suggest using something similar to facebook's robots.txt in order to prevent all crawlers from accessing all sensitive content in your website.

https://www.facebook.com/robots.txt (you may have to tweak it a bit)

After that, just point your app.. (eg. Rails) to be the custom origin server.

Now rewrite all the urls on your site to become absolute urls like :

https://d2d3cu3tt4cei5.cloudfront.net/hello.html

Basically all urls should point to your cloudfront distribution. Now if someone requests a file from https://d2d3cu3tt4cei5.cloudfront.net/hello.html and it does not have hello.html.. it can fetch it from your server (over an encrypted channel like https) and then serve it to the user.

so even if the user does a view source, they do not know your origin server... only know your cloudfront distribution.

more details on setting this up here:

http://blog.codeship.io/2012/05/18/Assets-Sprites-CDN.html

answered Jan 08 '13 at 20:19

sambehera

959
3
13
33

Thank you for your response. This doesn't really solve the problem in my question though because the origin would still serve direct requests instead of redirecting them to the CDN. – user319862 Jan 08 '13 at 23:27
Cloudfront basically serves as a cache for your site.. If all your links in your html files point other resources on your cloudfront domain.. the only request your webserver can serve are domain.com/index.html.. now people who look at your code can reverse-engineer your cloudfront urls in your html page .. say link_to ... and then try to manually request from www.domain.com/index2.html .. but only human users who want to peek at the sourcecode and increase your server load can do this.. not search robots. – sambehera Jan 09 '13 at 00:02
another thing you can do is whitelist only connections to your server from amazon's hostname. so NO ONE can access any file (except home.html) on your server directly except amazon cloudfront. This is server specific and apache would have different configurations than something like thin or unicorn. – sambehera Jan 09 '13 at 00:04
You might find this useful > http://docs.amazonwebservices.com/AmazonCloudFront/latest/DeveloperGuide/PrivateContent.html – sambehera Jan 09 '13 at 02:49

score 0 · Answer 2 · answered Feb 09 '14 at 23:09

Create a custom CNAME that only CloudFront uses. On your own servers, block any request for static assets not coming from that CNAME.

For instance, if your site is http://abc.mydomain.net then set up a CNAME for http://xyz.mydomain.net that points to the exact same place and put that new domain in CloudFront as the origin pull server. Then, on requests, you can tell if it's from CloudFront or not and do whatever you want.

Downside is that this is security through obscurity. The client never sees the requests for http://xyzy.mydomain.net but that doesn't mean they won't have some way of figuring it out.

score -1 · Answer 3 · answered May 16 '13 at 05:13

[I know this thread is old, but I'm answering it for people like me who see it months later.]

From what I've read and seen, CloudFront does not consistently identify itself in requests. But you can get around this problem by overriding robots.txt at the CloudFront distribution.

1) Create a new S3 bucket that only contains one file: robots.txt. That will be the robots.txt for your CloudFront domain.

2) Go to your distribution settings in the AWS Console and click Create Origin. Add the bucket.

3) Go to Behaviors and click Create Behavior: Path Pattern: robots.txt Origin: (your new bucket)

4) Set the robots.txt behavior at a higher precedence (lower number).

5) Go to invalidations and invalidate /robots.txt.

Now abc123.cloudfront.net/robots.txt will be served from the bucket and everything else will be served from your domain. You can choose to allow/disallow crawling at either level independently.

Another domain/subdomain will also work in place of a bucket, but why go to the trouble.

How can I hide a custom origin server from the public when using AWS CloudFront?

3 Answers3