1

I have a dynamic PHP web app which gets input params in the url (no surprise here). However, bingbot sometimes requests etremely long URLs from the site. E.g. > 10000 characters long urls. One of the inputs is an UTF name and bingbot somehow submits sketchy input names, thousands of characters long like this: \xc2\x83\xc3\x86... (goes on for thousands of characters).

Obviously, it gets a 404, because there is no such name in the database (and therefore no such page), but it occurred to me whether it's worth it to check the input length before querying the db (e.g. a name cannot be more than 100 characters long) and return a 404 instantly if it's too long. Is it standard practice? Or it's not worth the trouble, because the db handles it?

I'm thinking of not putting extra load on the db unnecessarily. Is this long input submitted as is by the db client interface (two calls: first a prepare for sanitizing the input and then the actual query) or the php db client knows the column size and truncates the input string before sending it down the wire?

Tom
  • 7,515
  • 7
  • 38
  • 54
  • Since we're dealing with SELECT statements, the only improvement you'll see from doing the check is less db traffic. Whether or not it's worth doing will depend on the resources you and your servers have available. – Kevin B Oct 09 '14 at 19:54
  • @Kevin B since I'm on a shared host, it may be a good thing – Tom Oct 09 '14 at 19:55
  • `if (strlen($badurl) > $limit)` will definitely be cheaper than doing a full-blown round trip through the database, when you already can trivially tell there won't be any matches. – Marc B Oct 09 '14 at 19:59
  • limiting what the user can request client side (max form input) while can be easily defeated if some one really wanted to, is a cheap (free) approach –  Oct 09 '14 at 20:00
  • It is worth to validate the string and then proceed to database to query if succeeded. Hackers will take advantage of sending unnecessary characters and get the valuable information from organizations. Every service call inputs should be validated and send respond back with proper message. – Anvesh Vejandla Oct 09 '14 at 20:01
  • But is it standard practice? Should it be done for every kind of string input? Seems a bit superfluous to put in this check everywhere. Do people usually do it? – Tom Oct 09 '14 at 20:02
  • @Tom Validate EVERY input. ALWAYS. I don't know if people *usually do* it, but I know everyone *should* always do it. – ItalyPaleAle Oct 09 '14 at 20:03
  • 1
    ask 10 programmers what is "standard practice" get 11 answers –  Oct 09 '14 at 20:04
  • @Qualcuno yes, input should be validated (prepared statements, etc.). But should "input length" be validated explicitly too? I guess it is usually left to the database, because it does it anyway and it's not worth to duplicate length checks. – Tom Oct 09 '14 at 20:13
  • 2
    @Tom prepared statements do not validate the input. They just protect against SQL injection, but they accept any input. For example, they don't protect against XSS (read the addendum to my answer). Validating the length is not mandatory, but it doesn't hurt. Eventually, if you're expecting an input of 10 characters and somebody sends you 10,000 characters, it means that he or she does not have good intentions. No exception! – ItalyPaleAle Oct 09 '14 at 20:23

2 Answers2

1

Not only what you're asking is more than legit, but I'd say it's something that you should be doing as part of the input filtering/validation. If you expect your input to be always shorter than 100 characters, everything that's longer should be filtered.
Also, it appears that you're getting UTF-8 strings: if you're not expecting them, you could simply filter out all characters that are not part of the standard ASCII set (even reduced, filtering all control characters away. For example $string = filter_var($input, FILTER_SANITIZE_FULL_SPECIAL_CHARS, FILTER_FLAG_STRIP_LOW).

This is not just a matter of DB performance, but also security!

PS: I hardly doubt that bot is actually Bing. Seems like a bot trying to hack your website.

Addendum: some suggestions about input validation

As I wrote above in some comments (and as others have written too), you should always validate every input. No matter what is that or where it comes from: if it comes from outside, it has to be validated.

The general idea is to validate your input accordingly to what you're expecting. With $input any input variable (anything coming from $_GET, $_POST, $_COOKIE, from external API's and from many $_SERVER variables as well - plus anything more that could be altered by a user, use your judgement and in doubt be overly cautious).

If you're requesting an integer or float number, then it's easy: just cast the input to (int) or (float)

$filtered = (int)$input;
$filtered = (float)$input;

If you're requesting a string, then it's more complicated. You should think about what kind of string you are requesting, and filter it accordingly. For example:

  • If you're expecting a string like a hexadecimal id (like some databases use), then filter all characters outside the 0-9A-Fa-f range: $filtered = preg_replace('/[^0-9A-Fa-f]/', '', $input);
  • If you're expecting an alphanumeric ID, filter it, removing all characters that are not part of that ASCII range. You can use the code posted above: $string = filter_var($input, FILTER_SANITIZE_FULL_SPECIAL_CHARS, FILTER_FLAG_STRIP_LOW);. This one removes all control characters too.
  • If you're expecting your input to be Unicode UTF-8, validate it. For example, see this function: https://stackoverflow.com/a/1523574/192024

In addition to this:

  • Always encode HTML tags. FILTER_SANITIZE_FULL_SPECIAL_CHARS will do that as well on filter_var. If you don't do that, you risk XSS (Cross-Site Scripting) attacks.
  • If you want to remove control characters and encode HTML entities but without removing the newline chracters (\n and \r), then you can use: $filtered = preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/u', '', htmlspecialchars($input, ENT_COMPAT, 'UTF-8'));

And much more. Use your judgement always.

PS: My approach to input filtering is to prefer sanitization. That is, remove everything "dangerous" and accept the sanitized input as if that was what the user wrote. Other persons will instead argue that input should only be accepted or refused.
Personally, I prefer the "sanitize and use" approach for web applications, as your users still may want to see something more than an error web page; on desktop/mobile apps I go with the "accept or refuse" method instead.
However, that's just a matter of personal preference, backed only by what my guts tell me about UX. You're free to follow the approach you prefer.

Community
  • 1
  • 1
ItalyPaleAle
  • 7,185
  • 6
  • 42
  • 69
0

There should be some sort of validation done on any data before it is used in a query. If you have a limit on the length of the name, then you could use that as part of the validation when checking the input. If it's over the limit, it can't be in there and then handle it accordingly. Whether it's a 404 or a page that displays an error message.

The load will go down if you are bypassing queries because a name is too long. Depending on how you are querying the database, LIKE or MATCH AGAINST and how your indexes are set up, will determine just how much load will go down.

slapyo
  • 2,979
  • 1
  • 15
  • 24
  • My question is: doesn't the db client interface do it? E.g. if a column has a max length of 100 and when preparing the statement I put in a 1000 char long string as an input for the column then does the db client truncates the string before sending it to the db? Because if it does anyway then I don't have to do anything. – Tom Oct 09 '14 at 19:58
  • @Tom every query to the DB uses resources. Think about it: communicate with the server, send the query to it, the server parses the query, etc. Again, though, it's not just a matter of performance, but first of all of security. – ItalyPaleAle Oct 09 '14 at 20:00
  • Yes, it will still query the database. The database won't decide that the value you are searching for is longer than the column length. It will run the query and return no results. – slapyo Oct 09 '14 at 20:05
  • Yes, of course, i was thinking of the second query (after preparing) when the statement is actually executed. Then the client in principle can know about column lenghts. – Tom Oct 09 '14 at 20:15