7

Javascript's version of f-strings allows for string escaping through use of a somewhat funny API, e.g.

function escape(str) {
    var div = document.createElement('div');
    div.appendChild(document.createTextNode(str));
    return div.innerHTML;
}
function escapes(template, ...expressions) {
  return template.reduce((accumulator, part, i) => {
    return accumulator + escape(expressions[i - 1]) + part
  })
}

var name = "Bobby <img src=x onerr=alert(1)></img> Arson"
element.innerHTML = escapes`Hi, ${name}` # "Hi, Bobby &lt;img src=x onerr=alert(1)&gt;&lt;/img&gt; Arson"

Does Python f-strings allow for a similar mechanism? or do you need to bring your own string.Formatter? Would a more pythonic implementation wrap results into a class with an overriden __str__() method before interpolation?

badp
  • 11,409
  • 3
  • 61
  • 89
  • This question brought to you by the YYYY-MM-DD Dates Will Be Parsed As YYYY Minus MM Minus DD If You Don't Quote Them consortium – badp Jun 11 '19 at 12:48
  • Hi, I think that could help you. Identify mu clearly the operation of the f-string in python https://cito.github.io/blog/f-strings/ – Delari Jesus Jun 14 '19 at 13:50

2 Answers2

10

When you're dealing with text that is going to be interpreted as code (e.g., text that the browser will parse as HTML or text that a database executes as SQL), you don't want to solve security issues by implementing your own escaping mechanism. You want to use the standard, widely tested tools to prevent them. This gives you much greater safety from attacks for several reasons:

  • The wide adoption means the tools are well tested and much less likely to contain bugs.
  • You know they have the best available approach to solving the problem.
  • They will help you avoid the common mistakes associated with generating the strings yourself.

HTML escaping

The standard tools for HTML escaping are templating engines, such as Jinja. The major advantage is that these are designed to escape text by default, rather than requiring you to remember to explicitly convert unsafe strings. (You do need to be cautious about bypassing or disabling, even temporarily, the escaping, though. I have seen my share of insecure attempts to insecurely construct JSON in templates, but the risk in templates is still lower than a system that requires explicit escaping everywhere.) Your example is pretty easy to implement with Jinja:

import jinja2

template_str = 'Hi, {{name}}'
name = "Bobby <img src=x onerr=alert(1)></img> Arson"

jinjaenv = jinja2.Environment(autoescape=jinja2.select_autoescape(['html', 'xml']))
template = jinjaenv.from_string(template_str)

print(template.render(name=name))
# Hi, Bobby &lt;img src=x onerr=alert(1)&gt;&lt;/img&gt; Arson

If you're generating HTML, though, chances are you're using a web framework such as Flask or Django. These frameworks include a templating engine and will require less set up than the above example.

MarkupSafe is a useful tool if you're trying to create your own template engine (Some Python templating engines use it internally, such as Jinja.), and you could potentially integrate it with a Formatter. But there's no reason to reinvent the wheel. Using a popular engine will result in much simpler, easier to follow, more recognizable code.

SQL injection

SQL injection is not solved through escaping. PHP has a nasty history that everyone has learned from. The lesson is use parameterized queries instead of trying to escape input. This prevents untrusted user data from ever being parsed as SQL code.

How you do this depends on exactly what libraries you're using for executing your queries, but for an example, doing so with SQLAlchemy's execute method looks like this:

session.execute(text('SELECT * FROM thing WHERE id = :thingid'), thingid=id)

Note that SQLAlchemy is not just escaping the text of id to ensure it does not contain attack code. It is actually differentiating between the SQL and the value for the database server. The database will parse the query text as a query, and then it will include the value separately after the query has been parsed. This makes it impossible for the value of id to trigger unintended side effects.

Note also that quoting issues are precluded by parameterized queries:

name = 'blah blah blah'
session.execute(text('SELECT * FROM thing WHERE name = :thingname'), thingname=name)

If you can't parameterize, whitelist in memory

Sometimes, it's not possible to parameterize something. Maybe you're trying to dynamically select a table name based on the input. In these cases, one thing you can do is have a collection of known valid and safe values. By validating that the input is one of these values and retrieving a known safe representation of it, you avoid sending user input into your query:

# This could also be loaded dynamically if needed.
valid_tables = {
    # Keys are uppercased for look up
    'TABLE1' : 'table1',
    'TABLE2': 'Table2',
    'TABLE3': 'TaBlE3',
    ...
}

def get_table_name(table_num):
    table_name = 'TABLE' + table_num
    try:
        return valid_tables[table_name]
    except KeyError:
        raise 'Unknown table number: ' + table_num


def query_for_thing(session, table_num):
    return session.execute(text('SELECT * FROM "{}"'.format(get_table_name(table_num))

The point is you never want to allow user input to go into your query as something other than a parameter.

Make sure that this whitelisting occurs in application memory. Do not perform the whitelisting in SQL itself. Whitelisting in the SQL is too late; by that time, the input has already been parsed as SQL, which would allow the attacks to be invoked before the whitelisting could take effect.

Make sure you understand your library

In the comments, you mentioned PySpark. Are you sure you're doing this right? If you create a data frame using just a simpler SELECT * FROM thing and then use PySpark filtering functions, are you sure it doesn't properly push those filters down to the query, precluding the need to format values into it unparameterized?

Make sure you understand how data is normally filtered and manipulated with your library, and check if that mechanism will use parameterized queries or otherwise be efficient enough under the hood.

With small data, just filter in memory

If your data isn't at least in the tens of thousands of records, then consider just loading it into memory and then filtering:

filter_name = 'blah blah blah'
results = session.execute(text('SELECT * FROM thing'))
filtered_results = [r for r in results if r.name == filter_name]

If this is fast enough and parameterizing the query is hard, then this approach avoids all the security headaches of trying to make the input safe. Test its performance with somewhat more data than you expect to see in prod. I would use at least double of the maximum you expect; an order of magnitude would be even safer if you can make it perform.

If you're stuck without parameterized query support, the last resort is very strict limits on inputs

If you're stuck with a client that doesn't support parameterized queries, first check if you can use a better client. SQL without parameterized queries is absurd, and it's an indication that the client you're using is very low quality and probably not well maintained; it may not even be widely used.

Doing the following is NOT recommended. I include it only as an absolute last resort. Don't do this if you have any other choice, and spend as much time as you can (even a couple of weeks of research, I dare say) trying to avoid resorting to this. It requires a very high level of diligence on the part of every team member involved, and most developers do not have that level of diligence.

If none of the above is a possibility, then the following approach may be all you can do:

Do not query on text strings coming from the user. There is no way to make this safe. No amount of quoting, escaping, or restricting is guaranteed. I don't know all the details, but I've read of the existence of Unicode abuses that can allow bypassing character restrictions and the like. It's just not worth it to try. The only text strings allowed should be whitelisted in application memory (as opposed to whitelisted via some SQL or database function). Note that even leveraging database level quoting functions (like PostgreSQL's quote_literal) or stored procedures can't help you here because the text has to be parsed as SQL to even reach those functions, which would allow the attacks to be invoked before the whitelisting could take effect.

For all other data types, parse them first and then have the language render them into an appropriate string. Doing so again means avoiding having user input parsed as SQL. This requires you to know the data type of the input, but that's reasonable since you'll need to know that to construct the query. In particular, the available operations with a particular column will be determined by that column's data types, and the operation and column type will determine what data types are valid for the input.

Here's an example for a date:

from datetime import datetime

def fetch_data(start_date, end_date):
    # Check data types to prevent injections
    if not isinstance(start_date, datetime):
        raise ValueError('start_date must be a datetime')
    if not isinstance(end_date, datetime):
        raise ValueError('end_date must be a datetime')

    # WARNING: Using format with SQL queries is bad practice, but we don't
    # have a choice because [client lib] doesn't support parameterized queries.
    # To mitigate this risk, we do not allow arbitrary strings as input.
    # We tightly control the input's data type (to something other than text or binary) and the format used in the query.
    session.execute(text(
        "SELECT * FROM thing WHERE timestamp BETWEEN CAST('{start}' AS TIMESTAMP) AND CAST('{end}' AS TIMESTAMP)"
        .format(
            # Make the format used explicit
            start=start_date.strftime('%Y-%m-%dT%H:%MZ'),
            end=end_date.strftime('%Y-%m-%dT%H:%MZ')
        )
    ))

user_input_start_date = '2019-05-01T00:00'
user_input_end_date = '2019-06-01T00:00'

parsed_start_date = datetime.strptime(user_input_start_date, "%Y-%m-%dT%H:%M")
parsed_end_date = datetime.strptime(user_input_end_date, "%Y-%m-%dT%H:%M")


data = fetch_data(parsed_start_date, parsed_end_date)

There's several details that you need to be aware of.

  1. Notice that in the same function as the query, we're validating the data type. This is one of the rare exceptions in Python where you don't want to trust duck typing. This is a safety feature that ensures insecure data won't be passed into your function accidentally.
  2. The format passed of the input when it's rendered into the SQL string is explicit. Again, this is about control and whitelisting. Don't leave it to any other library to decide what format the input will be rendered to; make sure you know exactly what the format is so that you can be certain that injections are impossible. I'm fairly certain that there's no injection possibility with the ISO 8601 date/time format, but I haven't confirmed that explicitly. You should confirm that.
  3. The quoting of the values is manual. That's okay. And the reason it's okay is because you know what data types you're dealing with and you know exactly what the string will look like after it's formatted. This is by design: you're maintaining very strict, very tight control over the input's format to prevent injections. You know whether quotes need to be added or not based on that format.
  4. Don't skip the comment about how bad this practice is. You have no idea who will read this code later and what knowledge or abilities they have. Competent developers who understand the security risks here will appreciate the warning; developers who weren't aware will be warned to use parameterized queries whenever available and to avoid carelessly including new conditions. If at all feasible, require that changes to these areas of code be reviewed by additional developers to further mitigate the risks.
  5. This function should have full control over generating the query. It should not delegate its construction out to other functions. This is because the data type checking needs to be kept very, very close to the construction of the query to avoid mistakes.

The effect of this is a sort of looser whitelisting technique. You can't whitelist specific values, but you can whitelist the kinds of values you're working with and control the format they're delivered in. Forcing callers to parse the values into a known data type reduces the possibility of an attack getting through.

I'll also note that callering code is free to accept the user input in whatever format is convenient and to parse it using whatever tools you wish. That's one of the advantages of requiring a dedicated data type instead of strings for input: you don't lock callers into a particular string format, just the data type. For date/times in particular, you might consider some third party libraries.

Here's another example with a Decimal value instead:

from decimal import Decimal

def fetch_data(min_value, max_value):
    # Check data types to prevent injections
    if not isinstance(min_value, Decimal):
        raise ValueError('min_value must be a Decimal')
    if not isinstance(max_value, Decimal):
        raise ValueError('max_value must be a Decimal')

    # WARNING: Using format with SQL queries is bad practice, but we don't
    # have a choice because [client lib] doesn't support parameterized queries.
    # To mitigate this risk, we do not allow arbitrary strings as input.
    # We tightly control the input's data type (to something other than text or binary) and the format used in the query.
    session.execute(text(
        "SELECT * FROM thing WHERE thing_value BETWEEN CAST('{minv}' AS NUMERIC(26, 16)) AND CAST('{maxv}' AS NUMERIC(26, 16))"
        .format(
            # Make the format used explicit
            # Up to 16 decimal places. Maybe validate that at start of function?
            minv='{:.16f}'.format(min_value),
            maxv='{:.16f}'.format(max_value)
        )
    ))

user_input_min = '78.887'
user_input_max = '89789.78878989'

parsed_min = Decimal(user_input_min)
parsed_max = Decimal(user_input_max)

data = fetch_data(parsed_min, parsed_max)

Everything is basically the same. Just a slightly different data type and format. You're free to use whatever data types your database supports, of course. For example, if your DB does not require specifying a scale and precision on the numeric type or would auto-cast a string or can handle the value unquoted, you can structure your query accordingly.

jpmc26
  • 28,463
  • 14
  • 94
  • 146
  • 1
    In the event there's any confusion on the matter, I chose to community wiki this because I am uncertain about the quality of the question; it may be Too Broad. As a result, I wished to forego any reputation gains. However, as it cannot be closed due to the bounty and I wish to discourage insecure coding practices, I felt compelled to post an answer anyway. – jpmc26 Jun 15 '19 at 22:06
  • If it helps with limiting the breadth of my question, I wanted to know if there was some mechanism that was or could be built into f-strings, e.g. something ilke `f"select * from Export_{yyyymm} where purchase_date between {since|quote} and {until|quote}"` (as opposed to `between "{since}" and "{until}"`). Though my question is more generic than that, in the case that triggered my question, I'm dealing with bigdata HQL queries that do not support parametrized queries :) – badp Jun 17 '19 at 09:02
  • @badp By HQL, you mean Hive QL? What client library are you using? – jpmc26 Jun 17 '19 at 15:57
  • @badp I've added a suggestion that might help some of your use cases. – jpmc26 Jun 20 '19 at 19:02
  • 1
    @badp For future questions, restrictions that prevent you from using the standard approach to your problem need to be included in the question. Answerers can't know that you've even thought of them if you don't say so, and providing the reason allows answerers to correct you if there's a misconception or to reference a standard approach for that particular situation when one exists. With two answers here now, it is probably too late to revise this question, as invalidating answers is discouraged. – jpmc26 Jun 21 '19 at 14:16
  • 1
    @badp I've updated with two more sections. One is to check on how PySpark's filtering functions work under the hood. They might issue parameterized queries or do something else to ensure the filtering is efficient enough. The other is a long diatribe about trying to secure non-text inputs. I *do not* recommend doing it, and I hope the length and difficulty and annoyance associated with the section is dissuading. – jpmc26 Jun 21 '19 at 17:02
  • I appreciate all of the work you've put into this answer, and it can be disappointing to see half a bounty awarded for all your work. The security angle behind Just Don't Do That is important to keep in mind but occasionally not super relevant; I didn't even mention SQL in my question, just as a tag. The core part of the question for me however really was, can I customize how interpolation works in f"" strings? That has gone unanswered, perhaps because it just _can't_ be customized? That would have been a fine answer. This is why you got an upvote from me, but not the full bounty :) – badp Jun 22 '19 at 18:18
  • (I hit the comment lenght limit but I still wanted to acknoweldge that, yes, while I still don't know if it is possible to customize how f"" strings interpolate, you DID mention Jinja as a viable alternative for string interpolation that can be adjusted to fit use cases.) – badp Jun 22 '19 at 18:24
  • @badp The core question has gone unanswered because implementing your own security system is almost always ill advised. For HTML, using standard tools is the best way to go. For SQL, it's just a bad idea from the get go because it doesn't work. You already know about `Formatter`, so there's nothing I could really add to building your own escaping mechanism besides my mention of MarkupSafe. I *always* focus on the intended use when answering a question more than I do the bare question itself because I typically find that there's better approaches than what the asker wanted. – jpmc26 Jun 22 '19 at 18:48
  • @badp As for the bounty, I'm not worried about it. As I mentioned before, I didn't post this for reputation. My goal here is to dissuade you from doing what you asked about because in the contexts you're planning to use it, it's not a good idea. – jpmc26 Jun 22 '19 at 18:50
  • The author of the question did not ask about SQL escaping. HTML escaping is clearly much less unreliable and much less attack-prone. Rather, the question was about the possibility to extend f-strings behavior. – Dmitriy Sintsov Jan 15 '21 at 12:45
  • @DmitriySintsov Review the comments on this answer, and be aware that HTML escaping is not less attack prone as attackers can inject JS. Then delete your comment as incorrect and irrelevant. – jpmc26 Jan 16 '21 at 13:29
  • I use Jinja2 for html escaping on daily basis. In many years. It uses autoescape=True by default in Flask / Django. – Dmitriy Sintsov Jan 17 '21 at 20:54
  • I know that real reliable autoescaping templates at the back end (server-side) are provided by XSL / XSLT. But it did not become widely adopted. Still, Django / Flask rarely has html escaping vulnerabilities. Clearly less than PHP SQL escaping has. – Dmitriy Sintsov Jan 17 '21 at 20:59
  • @DmitriySintsov You are not making any sense. The author is not using a templating engine, and I specifically instruct them to do so and specifically mentioned the points about it in your comment. This does not mean HTML is somehow inherently safe against injection. It appears you have read neither the question nor my answer in any detail. Do so before leaving comments on people's answers. – jpmc26 Jan 19 '21 at 14:19
2

You do not need to bring your own formatter if you're using python 3.6 or newer. Python 3.6 introduced formatted string literals, see PEP 498: Formatted string literals.

Your example in python 3.6 or newer would look like this:

name = "Bobby <img src=x onerr=alert(1)></img> Arson"
print(f"Hi, {name}")  # Hi, Bobby <img src=x onerr=alert(1)></img> Arson

The format specification that can be used with str.format() can also be used with formatted string literals.

This example,

my_dict = {'A': 21.3, 'B': 242.12, 'C': 3200.53}

for key, value in my_dict.items():
    print(f"{key}{value:.>15.2f}")

will print the following:

A..........21.30
B.........242.12
C........3200.53

Additionally, since the string is evaluated at runtime, any valid python expression can be used, for example,

name = "Abby"
print(f"Hello, {name.upper()}!")

will print

Hello, ABBY!
  • 6
    The author is trying to prevent XSS attacks via HTML injection of user input. Your example does not do that. It leaves the attack unescaped. – jpmc26 Jun 15 '19 at 16:05
  • 1
    The asker is already aware of formatted string literals, and refers to them as "f-strings" in the question title, which is an alias also used in the documentation you cite. This question is about escaping fields in a formatted string literal, so your answer does not really address it. – Aaron Bentley Jun 21 '19 at 17:37