0

I am sanitizing a contact form string :

$note = filter_var($_POST["note"], FILTER_SANITIZE_STRING);

Which works great except when people write in inches (") and feet ('). So I'm interested in 5" 8" 10" & 1' comes up as I'm interested in 5" 8" 10" & 1' Which is a bit of a garbled mess.

Can I sanitize yet keep my I'm 5'9"?

Jon
  • 6,437
  • 8
  • 43
  • 63
  • 2
    For what purpose are you sanitising to begin with...? – deceze Mar 07 '16 at 15:01
  • 1
    `'` and `"` are identical to `'` and `"` respectively... in an HTML text context. If you're using this to save to a database or email as plain-text, you're doing it very wrong. – Niet the Dark Absol Mar 07 '16 at 15:04
  • 1
    [What does FILTER_SANITIZE_STRING do?](http://stackoverflow.com/questions/23392128/what-does-filter-sanitize-string-do) - Spoiler: nothing that actually makes sense or is useful. – Álvaro González Mar 07 '16 at 15:06
  • 1
    you need to sanitize for the target usage environment. and if the string filter is removing characters you need, then you shouldn't be using that method in the first place. – Marc B Mar 07 '16 at 15:08
  • Keeping in mind what Marc B pointed out, you could try applying the `FILTER_FLAG_NO_ENCODE_QUOTES` flag. Here are the [docs](http://php.net/manual/en/filter.filters.sanitize.php) – mrun Mar 07 '16 at 15:26
  • @mrun I think we're all more hinting in the direction of **simply not sanitising at all.** – deceze Mar 07 '16 at 15:30
  • @deceze With all due respect, you're doing so without having your first question answered yet. – mrun Mar 07 '16 at 15:35
  • 1
    @mrun That was a rhetorical question really. "Sanitisation" as a concept is icky at best, and there's no such thing as a one-size-fits-all sanitisation function. In 99% of cases, sanitisation is nonsense. I'll go out on a limb and predict that if OP knew what they're sanitising for, they wouldn't have posted this question because they'd know what the sanitisation does and possibly how to work around its undesired side effects. – deceze Mar 07 '16 at 15:39
  • I am sanitizing to remove data that is potentially harmful for my application. – Jon Mar 07 '16 at 17:03
  • That is a very vague statement, and it's conflicting with your actual needs. You *want* single quote characters in your data; clearly removing or garbling them is *not* in your interest. You'll probably have to secure your application using the traditional escape-as-necessary approach instead of a (useless) global sanitization approach. See http://kunststube.net/escapism. – deceze Mar 07 '16 at 18:36
  • How can data be harmful in general terms? That doesn't make any sense. Data can only be harmful in specific contexts. Letter "a" is harmless regarding SQL injection or XSS but a stone letter falling from a building can send you to hospital. – Álvaro González Mar 08 '16 at 13:30

1 Answers1

2

Computer data itself is neither harmful nor innocuous. It's just a piece of information that can be later be used for a given purpose.

Sometimes, data is used as computer source code and such code eventually leads to physical actions (a disk spins, a led blinks, a picture is uploaded to remote computer, a thermostat turns off the boiler...). And it's then (and only then) when data can become harmful; we even lose expensive space ships now and then because of software bugs.

Code you write yourself can be as harmful or innocuous as your abilities or good faith dictate. The big problem comes when your application has a vulnerability that allows execution of untrusted third-party code. This is particularly serious in web applications, which are connected to the open internet and are expected to receive data from anywhere in the world. But, how's that physically possible? There're several ways but the most typical case is due to dynamically generated code and this happens all the time in modern www. You use PHP to generate SQL, HTML, JavaScript... If you pick untrusted arbitrary data (e.g. an URL parameter or a form field) and use it to compose code that will later be executed (either by your server or by the visitor's browser) someone can be hacked (either you or your users).

You'll see that everyday here at Stack Overflow:

$username = $_POST["username"];
$row = mysql_fetch_array(mysql_query("select * from users where username='$username'"));
<td><?php echo $row["title"]; ?></td>
var id = "<?php echo $_GET["id"]; ?>";

Faced to this problem, some claim: let's sanitize! It's obvious that some characters are evil so we'll remove them all and we're done, right? And then we see stuff like this:

$username = $_POST["username"];
$username = strip_tags($username);
$username = htmlentities($username);
$username = stripslashes($username);
$row = mysql_fetch_array(mysql_query("select * from users where username='$username'"));

This is a surprisingly widespread misconception adopted even by some professionals. You see the symptoms everywhere: your comment is mutilated at first < symbol, you get "your password cannot contain spaces" on sign-up and you read Why can’t I use certain words like "drop" as part of my Security Question answers? in the FAQ. It's even inside computer languages: whenever you read "sanitize", "escape"... in a function name (without further context), you have a good hint that it might be a misguided effort.

It's all about establishing a clear separation about data and code: user provides data but only you provide code. And there isn't a universal one-size-fits-all solution because each computer language has its own syntax and rules. DROP TABLE users; can be terribly dangerous in SQL:

mysql> DROP TABLE users;
Query OK, 56020 rows affected (0.52 sec)

(oops!)... but it's not as bad in e.g. JavaScript. Look, it doesn't even run:

C:\>node
> DROP TABLE users;
SyntaxError: Unexpected identifier
    at Object.exports.createScript (vm.js:24:10)
    at REPLServer.defaultEval (repl.js:235:25)
    at bound (domain.js:287:14)
    at REPLServer.runBound [as eval] (domain.js:300:12)
    at REPLServer.<anonymous> (repl.js:427:12)
    at emitOne (events.js:95:20)
    at REPLServer.emit (events.js:182:7)
    at REPLServer.Interface._onLine (readline.js:211:10)
    at REPLServer.Interface._line (readline.js:550:8)
    at REPLServer.Interface._ttyWrite (readline.js:827:14)
>

This last example also illustrates that it's not only a security concern. Even if you're not being hacked, generating code from random input can simply make your app crash:

SELECT * FROM customers WHERE last_name='O'Brian';

You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'Brian''

So, what shall be done then if there isn't a universal solution?

  1. Understand the problem:

    If you inject raw literal data improperly it can become code (and sometimes invalid code).

  2. Use the specific mechanism for each technology:

    If target language requires escaping:

    <p><3 to code</p><p>&lt;3 to code</p>

    ... find a specific tool to escape in source language:

    echo '<p>' . htmlspecialchars($motto) . '</p>';
    

    If language/framework/technology allows to send data in a separate channel, do it:

     $sql = 'SELECT password_hash FROM user WHERE username=:username';
     $params = array(
         'username' => $username,
     );
    
Álvaro González
  • 142,137
  • 41
  • 261
  • 360
  • I find this blog a good overview to the problem http://kunststube.net/escapism/ - "The Great Escapism (Or: What You Need To Know To Work With Text Within Text)" – aland Jun 14 '17 at 21:16