User agent header - abbreviation for mysql storing

Question

According to this thread, and specially this post: https://stackoverflow.com/a/6595973/1125465, Microsoft as always shows off. The size of user agent, can be really, really huge.

I'm working on a little visitors library in php, and I want to store user agent information. I cannot decide on the data type and length.

So my question is: have you got any ideas, on how to shorten the user agent, to some "normal" size? (for example 256 chars).

Note: Developers use user agents for detecting the user browser, and operating systems. So according to the linked example, all the stupid numbers from M$ are just... Just are. As always, getting on our nerves. So the idea is to make a function that shorten the user agent string but is not losing the important information.

I think that such a function should:

Not depend on future updates and new browsers (no hardcoded strings)
Have a simple mechanism that decide what to delete (for example, if there is a number, comma, number, comma, number, comma, number, ..., it can delete it, it is not interesting).
And at the end if all the operations still results in too long user agent (lets say 256 chars), there is nothing more to do, so just cut off the rest. This is one per million, so the data can be lost.

Additional note: I know, that I can make a function that get the browser, and OS type from user agent, and save only these values. But as always such a functions have hardcoded names, and if browser isn't recognized, it for example return "Unrecognized browser'. So in the future everyone must remember about updating these function. And if we save shorten user agent, the information isn't lost (as only the script that is reading the database must have new recognition system). But the entries in database are reliable and consistent, as should be.

UPDATE: As there should be some code, and there is a problem with idea, and not the problem with existing code, I will write some minimum code, that I wrote so far ;) :

<?php
    function shorten($useragent, $maxsize = 256) {
        $shorten = $useragent;
        ... // ?
        $shorten = substr($shorten, 0, $maxsize); // the "last hope" cut
        return $shorten;
    }
    echo shorten($_SERVER['HTTP_USER_AGENT']);
?>

You could extract certain information from it by using `preg_replace()` and just store what you need. — ThePixelPony, May 19 '14 at 16:46

score 6 · Accepted Answer · answered Jun 20 '14 at 15:52

There are no rules for User-Agent strings, so there is no way to create a completely correct and future-proof parser. There is a general pattern though:

User-Agent: <engine-string> <engine-string> ...

Where engine-string has form:

<agent-name> (<comment>; <comment>; ...)

Each engine string (I just called it that from my understanding, that may be not correct) may or may not have comments.

For example:

Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) ↲
AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e ↲
Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

(This is a single string, I just broke it into lines.) It seems, whenever someone does a fork of a browser engine, they just append their thing to the end. So we have some abstract "Mozilla" browser (a legacy of the "First Browser War") which thinks it's on iPhone. Then we see that there is a WebKit (which remembers that it was born as KHTML some long time ago). Then there is some Version/6.0 modification, which was then modified into Mobile/10A5376e, which became Safari/8536.25, which finally reveals the secret that it is actually a mobile Google bot.

Another example:

Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.4; ↲
InfoPath.1; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.4506.2152; ↲
.NET CLR 3.5.30729; .NET CLR 1.1.4322)

This is a single engine, but it has much to say in parentheses.

So the general observation is:

last engine strings are most important,
last comments in parenteses are less important.

Having that in mind, my idea would be to parse the string into these engine and comment tokens, then from each engine section throw away comments starting from, say, the fifth. Then, if it is still not enough, throw away engine sections starting from the second (the first is often an abstract "Mozilla", but often has useful comments; also sometimes it is actually something concrete, especially for web crawlers).

When parsing, we need to take into account that occasionally there may be strings not following this format. They can be saved to a log file for later inspection and then simply cut to the needed length to fit to the database.

Thank you very much for your answer. It is the only one, but I'm quite sure noone will fullfill this subject in a broader way. +1, accepted. Best regards, and really appreciated your work and knowledge. Thanks! — Jacek Kowalewski, Jun 20 '14 at 18:49

User agent header - abbreviation for mysql storing

1 Answers1

Linked