7

Are there any security exploits that could occur in this scenario:

eval(repr(unsanitized_user_input), {"__builtins__": None}, {"True":True, "False":False})

where unsanitized_user_input is a str object. The string is user-generated and could be nasty. Assuming our web framework hasn't failed us, it's a real honest-to-god str instance from the Python builtins.

If this is dangerous, can we do anything to the input to make it safe?

We definitely don't want to execute anything contained in the string.

See also:

The larger context which is (I believe) not essential to the question is that we have thousands of these:

repr([unsanitized_user_input_1,
      unsanitized_user_input_2,
      unsanitized_user_input_3,
      unsanitized_user_input_4,
      ...])

in some cases nested:

repr([[unsanitized_user_input_1,
       unsanitized_user_input_2],
      [unsanitized_user_input_3,
       unsanitized_user_input_4],
       ...])

which are themselves converted to strings with repr(), put in persistent storage, and eventually read back into memory with eval.

Eval deserialized the strings from persistent storage much faster than pickle and simplejson. The interpreter is Python 2.5 so json and ast aren't available. No C modules are allowed and cPickle is not allowed.

Community
  • 1
  • 1
gravitation
  • 1,939
  • 2
  • 21
  • 26
  • "The reason for doing this would make much more sense if I presented the larger context" Could you please elaborate on the problem? Currently the command seems utterly pointless - the same as doing nothing with `unsanitized_user_input`.. – dbr Jul 11 '09 at 02:13
  • "we have thousands of these" That makes no sense. Why would you store input that way? There's no point in repr()ing a string for storage purposes. – Miles Jul 11 '09 at 04:58
  • Why aren't you using pickle or something simpler? – S.Lott Jul 11 '09 at 11:11
  • pickle is more than an order of magnitude slower and not necessarily more secure, if the docs are to be believed – gravitation Jul 11 '09 at 14:41
  • I too am curious why the data is being repr-ed in the first place. – Nick Johnson Jul 11 '09 at 17:49
  • the data is repr'd to change the nested list structure into a string so it can be compressed and stored in a blob property – gravitation Jul 12 '09 at 18:08
  • To ask the obvious question, is serialization/de-serialization time really a bottleneck, or are you engaging in premature optimization? – Nick Johnson Jul 13 '09 at 09:32
  • Good observation Nick, it is _not_ a bottleneck with the current number of users of the app and in fact the current actual cpu usage is very low, however after some load problems with my previous app I have tested with 1000 users and unfortunately the serialization does become quite the bottleneck at that point. However there may never be that many real users, it is always a guessing game – gravitation Jul 15 '09 at 02:29

5 Answers5

19

It is indeed dangerous and the safest alternative is ast.literal_eval (see the ast module in the standard library). You can of course build and alter an ast to provide e.g. evaluation of variables and the like before you eval the resulting AST (when it's down to literals).

The possible exploit of eval starts with any object it can get its hands on (say True here) and going via .__class_ to its type object, etc. up to object, then gets its subclasses... basically it can get to ANY object type and wreck havoc. I can be more specific but I'd rather not do it in a public forum (the exploit is well known, but considering how many people still ignore it, revealing it to wannabe script kiddies could make things worse... just avoid eval on unsanitized user input and live happily ever after!-).

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
8

If you can prove beyond doubt that unsanitized_user_input is a str instance from the Python built-ins with nothing tampered, then this is always safe. In fact, it'll be safe even without all those extra arguments since eval(repr(astr)) = astr for all such string objects. You put in a string, you get back out a string. All you did was escape and unescape it.

This all leads me to think that eval(repr(x)) isn't what you want--no code will ever be executed unless someone gives you an unsanitized_user_input object that looks like a string but isn't, but that's a different question--unless you're trying to copy a string instance in the slowest way possible :D.

hao
  • 10,138
  • 1
  • 35
  • 50
  • 1
    That's exactly right; I definitely don't want anything in the string to be executed. The reason for doing this would make much more sense if I presented the larger context but I tried to simplify the scenario for the question. – gravitation Jul 11 '09 at 01:58
5

With everything as you describe, it is technically safe to eval repred strings, however, I'd avoid doing it anyway as it's asking for trouble:

  • There could be some weird corner-case where your assumption that only repred strings are stored (eg. a bug / different pathway into the storage that doesn't repr instantly becmes a code injection exploit where it might otherwise be unexploitable)

  • Even if everything is OK now, assumptions might change at some point, and unsanitised data may get stored in that field by someone unaware of the eval code.

  • Your code may get reused (or worse, copy+pasted) into a situation you didn't consider.

As Alex Martelli pointed out, in python2.6 and higher, there is ast.literal_eval which will safely handle both strings and other simple datatypes like tuples. This is probably the safest and most complete solution.

Another possibility however is to use the string-escape codec. This is much faster than eval (about 10 times according to timeit), available in earlier versions than literal_eval, and should do what you want:

>>> s = 'he\nllo\' wo"rld\0\x03\r\n\tabc'
>>> repr(s)[1:-1].decode('string-escape') == s
True

(The [1:-1] is to strip the outer quotes repr adds.)

Community
  • 1
  • 1
Brian
  • 116,865
  • 28
  • 107
  • 112
3

Generally, you should never allow anyone to post code.

So called "paid professional programmers" have a hard-enough time writing code that actually works.

Accepting code from the anonymous public -- without benefit of formal QA -- is the worst of all possible scenarios.

Professional programmers -- without good, solid formal QA -- will make a hash of almost any web site. Indeed, I'm reverse engineering some unbelievably bad code from paid professionals.

The idea of allowing a non-professional -- unencumbered by QA -- to post code is truly terrifying.

S.Lott
  • 384,516
  • 81
  • 508
  • 779
1
repr([unsanitized_user_input_1,
      unsanitized_user_input_2,
      ...

... unsanitized_user_input is a str object

You shouldn't have to serialise strings to store them in a database..

If these are all strings, as you mentioned - why can't you just store the strings in a db.StringListProperty?

The nested entries might be a bit more complicated, but why is this the case? When you have to resort to eval to get data from the database, you're probably doing something wrong..

Couldn't you store each unsanitized_user_input_x as it's own db.StringProperty row, and have group them by an reference field?

Either of those may not be applicable, since I've no idea what you're trying to achieve, but my point is - can you not structure the data in a way you where don't have to rely on eval (and also rely on it not being a security issue)?

dbr
  • 165,801
  • 69
  • 278
  • 343
  • One reason is that the strings need to be compressed to fit in App Engine's 1MB entity limit, and I thought the overhead of compressing 1000s of strings individually would probably be much higher than serializing them, compressing them all together, and putting them into a blob. The space savings would probably be less too. But it's a good point... I'm definitely looking for ways to avoid eval without increasing running cost too much. – gravitation Jul 11 '09 at 17:00