Not only what you're asking is more than legit, but I'd say it's something that you should be doing as part of the input filtering/validation. If you expect your input to be always shorter than 100 characters, everything that's longer should be filtered.
Also, it appears that you're getting UTF-8 strings: if you're not expecting them, you could simply filter out all characters that are not part of the standard ASCII set (even reduced, filtering all control characters away. For example $string = filter_var($input, FILTER_SANITIZE_FULL_SPECIAL_CHARS, FILTER_FLAG_STRIP_LOW)
.
This is not just a matter of DB performance, but also security!
PS: I hardly doubt that bot is actually Bing. Seems like a bot trying to hack your website.
Addendum: some suggestions about input validation
As I wrote above in some comments (and as others have written too), you should always validate every input. No matter what is that or where it comes from: if it comes from outside, it has to be validated.
The general idea is to validate your input accordingly to what you're expecting. With $input any input variable (anything coming from $_GET
, $_POST
, $_COOKIE
, from external API's and from many $_SERVER
variables as well - plus anything more that could be altered by a user, use your judgement and in doubt be overly cautious).
If you're requesting an integer or float number, then it's easy: just cast the input to (int) or (float)
$filtered = (int)$input;
$filtered = (float)$input;
If you're requesting a string, then it's more complicated. You should think about what kind of string you are requesting, and filter it accordingly. For example:
- If you're expecting a string like a hexadecimal id (like some databases use), then filter all characters outside the 0-9A-Fa-f range:
$filtered = preg_replace('/[^0-9A-Fa-f]/', '', $input);
- If you're expecting an alphanumeric ID, filter it, removing all characters that are not part of that ASCII range. You can use the code posted above:
$string = filter_var($input, FILTER_SANITIZE_FULL_SPECIAL_CHARS, FILTER_FLAG_STRIP_LOW);
. This one removes all control characters too.
- If you're expecting your input to be Unicode UTF-8, validate it. For example, see this function: https://stackoverflow.com/a/1523574/192024
In addition to this:
- Always encode HTML tags.
FILTER_SANITIZE_FULL_SPECIAL_CHARS
will do that as well on filter_var
. If you don't do that, you risk XSS (Cross-Site Scripting) attacks.
- If you want to remove control characters and encode HTML entities but without removing the newline chracters (\n and \r), then you can use:
$filtered = preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/u', '', htmlspecialchars($input, ENT_COMPAT, 'UTF-8'));
And much more. Use your judgement always.
PS: My approach to input filtering is to prefer sanitization. That is, remove everything "dangerous" and accept the sanitized input as if that was what the user wrote. Other persons will instead argue that input should only be accepted or refused.
Personally, I prefer the "sanitize and use" approach for web applications, as your users still may want to see something more than an error web page; on desktop/mobile apps I go with the "accept or refuse" method instead.
However, that's just a matter of personal preference, backed only by what my guts tell me about UX. You're free to follow the approach you prefer.