4

Maybe this is trivial, but I haven't found anything meaningful or I didn't know where to look...

(How) is it possible to to send a curl / whatever command as soon as a certain path is requested?

Something along these lines, but that would actually work:

location / {
curl --data 'v=1&t=pageview&tid=UA-XXXXXXXX-X&cid=123&dp=hit'  https://google-analytics.com/collect
}
lucian
  • 623
  • 10
  • 21
  • 2
    I don't know (how) if this can be done with "pure" nginx, but can give you a recipe on how to do this with [OpenResty](https://github.com/openresty/openresty) (or [ngx_http_lua_module](https://github.com/openresty/lua-nginx-module)) if this is an option for you. – Ivan Shatsky Dec 30 '18 at 04:49
  • If it gets the job done, why not – lucian Jan 01 '19 at 05:14

4 Answers4

6

(as pointed out in the comments), ngx_http_lua_module can do it!

location / {
          access_by_lua_block  {
            os.execute("/usr/bin/curl --data 'v=1&t=pageview&tid=UA-XXXXXXXX-X&cid=123&dp=hit'  https://google-analytics.com/collect >/dev/null 2>/dev/null") 
        }
}

note that the execution halts the pageload until curl has finished. to run curl in the background and continue the pageload immediately, add a space and an & to the end so it looks like

>/dev/null 2>/dev/null &")
hanshenrik
  • 19,904
  • 4
  • 43
  • 89
  • Yep, it's working! Had to install openresty and no more http/2 support for now - hopefully they'll release a version based on nginx >1.13.9 soon... Any way to pass existing headers as parameters into that? – lucian Jan 01 '19 at 16:40
  • @LucianDavidescu existing headers? what do you mean, request headers sent from the client/browser to nginx? or do you mean headers that nginx send back to the client/browser? what kind of data do you want to add to curl? – hanshenrik Jan 02 '19 at 11:48
  • The request headers mainly, but if some of the response headers can be added, it would be even greater. – lucian Jan 02 '19 at 13:40
  • @LucianDavidescu can you provide some additional info on how do you want to use request/response headers? Mentioning response headers means that you use nginx/openresty as a reverse proxy for some backend? And I don't have any problems with HTTP/2 and latest stable version of openresty, what version of OpenSSL do you have at your server? Did you build openresty from source or install it from some repository? Is `http_v2_module` mentioned in `nginx -V` output? – Ivan Shatsky Jan 04 '19 at 12:51
  • 3
    @LucianDavidescu I honestly can't believe this solution is even considered to be acceptable even for "testing" purposes, let alone any sort of production environment. Spawning a new process in the background, whilst immediately returning back to the client, makes it trivial for a single client using a single TCP connection to completely bring down your whole machine in a matter of seconds, yes, your whole machine, through a trivial exhaustion and overload of the process table. The solution proposed in this answer is hardly different from what would be a *forkbomb*! – cnst Jan 04 '19 at 15:53
  • @IvanShatsky - sorry, what I meant to say was no push. The response headers would be set from the wordpress API to specify some extra relevant data about the content served. – lucian Jan 04 '19 at 21:52
  • @cnst - you mean an attacker (aware or not of the exact implementation) or just a random client doing random stuff? – lucian Jan 04 '19 at 21:57
  • 2
    @LucianDavidescu, it could be anything. What if DNS is down, or Google decides to throttle you, or IPv6 gets configured, but doesn't work? Each curl instance would persist for 3+ minutes, with more coming each request. You wouldn't be able to login into a system with shell, because process table is exhausted. Your best bet would be if you're using shared hosting, and/or fork is just slow (and they really are), and are limited to 20 to 100 forks a second, which takes 2/3rd of your CPU power, slowing down the rest of your site. I don't think you fully realise just how expensive forks are. – cnst Jan 04 '19 at 22:42
  • @cnst but couldn't a process similar to the one that writes the log entry send the request? – lucian Jan 05 '19 at 13:31
  • @LucianDavidescu, yes — see my answer. – cnst Jan 05 '19 at 17:14
  • 2
    @LucianDavidescu seems you can get an array of request headers (sent by the browser) by running `local headers, err = ngx.resp.get_headers();`, and get an array of response headers (sent by nginx) by using `local headers, err = ngx.req.get_headers()` - but you should probably use `log_by_lua_block` instead of `access_by_lua_block` – hanshenrik Jan 06 '19 at 08:21
  • 1
    Found this - https://github.com/vorodevops/nginx-analytics-measurement-protocol/tree/master/lua it uses proxy_pass, works quite nice so far. – lucian Jan 07 '19 at 12:35
5

What you're trying to do — execute a new curl instance for Google Analytics on each URL request on your server — is a wrong approach to the problem:

  1. Nginx itself is easily capable of servicing 10k+ concurrent connections at any given time as a lower limit, i.e., as a minimum, if you do things right, see https://en.wikipedia.org/wiki/C10k_problem.

  2. On the other hand, the performance of fork, the underlying system call that creates a new process, which would be necessary if you want to run curl for each request, is very slow, on the order 1k forks per second as an upper limit, e.g., if you do things right, that's the fastest it'll ever go, see Faster forking of large processes on Linux?.


What's the best alternative solution with better architecture?

  • My recommendation would be to perform this through batch processing. You're not really gaining anything by doing Google Analytics in real time, and a 5 minute delay in statistics should be more than adequate. You could write a simple script in a programming language of your choice to look through relevant http://nginx.org/r/access_log, collect the data for the required time period, and make a single batch request (and/or multiple individual requests from within a single process) to Google Analytics with the requisite information about each visitor in the last 5 minutes. You can run this as a daemon process, or as a script from a cron job, see crontab(5) and crontab(1).

  • Alternatively, if you still want real-time processing for Google Analytics (which I don't recommend, because most of these services themselves are implemented on an eventual consistency basis, meaning, GA itself wouldn't necessarily guarantee accurate real-time statistics for the last XX seconds/minutes/hours/etc), then you might want to implement a daemon of some sort to handle statistics in real time:

    • My suggestion would still be to utilise access_log in such daemon, for example, through a tail -f /var/www/logs/access_log equivalent in your favourite programming language, where you'd be opening the access_log file as a stream, and processing data as it comes and when it comes.

    • Alternatively, you could implement this daemon to have an HTTP request interface itself, and duplicate each incoming request to both your actual backend, as well as this extra server. You could multiplex this through nginx with the help of the not-built-by-default auth_request or add_after_body to make a "free" subrequest for each request. This subrequest would go to your server, for example, written in Go. The server would have at least two goroutines: one would process incoming requests into a queue (implemented through a buffered string channel), immediately issuing a reply to the client, to make sure to not delay nginx upstream; another one would receive the requests from the first one through the chan string from the first, processing them as it goes and sending appropriate requests to Google Analytics.

Ultimately, whichever way you'd go, you'd probably still want to implement some level of batching and/or throttling, because I'd imagine at one point, Google Analytics itself would likely have throttling if you keep sending it requests from the same IP address on a very excessive basis without any sort of a batch implementation at stake. As per What is the rate limit for direct use of the Google Analytics Measurement Protocol API? as well as https://developers.google.com/analytics/devguides/collection/protocol/v1/limits-quotas, it would appear that most libraries implement internal limits to how many requests per second they'd be sending to Google.

cnst
  • 25,870
  • 6
  • 90
  • 122
  • Indeed, that seems to be the scalable long-term solution. However, on the one hand the simultaneous connections and rate-limits are quite high for most use cases anyway (unless there are also performance issues) while on the other hand i think that a "quick and dirty" approach may come handy useful at least for testing purposes. – lucian Jan 04 '19 at 07:57
  • 1
    btw i wrote some code to parse nginx access logs in PHP, see line 58 here https://github.com/divinity76/http_log_parser/blob/master/create_database.php#L58 (but that code is from 2015 and unmaintained, idk if there's been any changes since 2015) – hanshenrik Jan 04 '19 at 08:03
3

If everything you need is to submit a hit to Google Analytics, then it can be accomplished easier: Nginx can modify page HTML on the fly, embedding GA code before the closing </body> tag:

sub_filter_once on;

sub_filter '</body>' "<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');

ga('create', 'UA-XXXXXXXX-X', 'auto');
ga('send', 'pageview');
</script></body>";

location / {
}

This Nginx module is called sub.

Alexander Azarov
  • 12,971
  • 2
  • 50
  • 54
  • what happens to non-html files that happen to contain the phrase `

    ` ? for example XML files?

    – hanshenrik Jan 07 '19 at 17:15
  • 1
    @hanshenrik it's configurable. The default configuration is to replace in the files with `text/html` MIME type and it's possible to permit in others. – Alexander Azarov Jan 07 '19 at 17:47
  • It's not that I don't have access to the site to put the javascript there in the first place if that's what I wanted to do... – lucian Jan 07 '19 at 18:14
1

Here's how I did it eventually - proxy_pass instead of curl - based on this: https://github.com/vorodevops/nginx-analytics-measurement-protocol/tree/master/lua. The code assumes openresty or just lua installed. Not sure if the comments format is compatible (didn't test) so it may be best to delete them before using it.

# pick your location 

location /example {

    # invite lua to the party

    access_by_lua_block  {

        # set request parameters

        local request = {
            v = 1,
            t = "pageview",

            # don' forget to put your own property here

            tid = "UA-XXXXXXX-Y",

            # this is a "unique" user id based on a hash of ip and user agent, not too reliable but possibly best that one can reasonably do without cookies

            cid = ngx.md5(ngx.var.remote_addr .. ngx.var.http_user_agent),
            uip = ngx.var.remote_addr,
            dp = ngx.var.request_uri,
            dr = ngx.var.http_referer,
            ua = ngx.var.http_user_agent,

            # here you truncate the language string to make it compatible with the javascript format - you'll want either the first two characters like here (e.g. en) or the first five (e.g en_US) with ...1, 5

            ul = string.sub(ngx.var.http_accept_language, 1, 2)
            }

        # use the location.capture thingy to send everything to a proxy

        local res = ngx.location.capture(  "/gamp",  {
        method = ngx.HTTP_POST,
        body = ngx.encode_args(request)
        })
    }
}


# make a separate location block to proxy the request away

location = /gamp {
    internal;
    expires epoch;
    access_log off;
    proxy_pass_request_headers off;
    proxy_pass_request_body on;
    proxy_pass https://google-analytics.com/collect;
}
lucian
  • 623
  • 10
  • 21