15

Say I have a running CPython session,

Is there a way to run the data (bytes) from a pyc file directly? (without having the data on-disk necessarily, and without having to write a temporary pyc file)

Example script to show a simple use-case:

if foo:
    # Intentionally ambiguous, since the data source
    # is a detail and answers shouldn't depend this detail.
    data = read_data_from_somewhere()
else:
    data = open("bar.pyc", 'rb').read()

assert(type(data) is bytes)

code = bytes_to_code(data)

# call a method from the loaded code
code.call_function()

Exact use isn't important, but generating code dynamically and copying over a network to execute is one use-case (for the purpose of thinking about this question).


Here are some example use-cases, which made me curious to know how this can be done:

  • Checking Python scripts for malicious code.
    If a single command can access a larger body of code hidden in binary data, what would that command look like?
  • Dynamically generate code and cache it for re-use (not necessarily on disk, could use a data-base for example).
  • Ability to send pre-compiled byte-code to a process, control an application which embeds Python for eg.
ideasman42
  • 42,413
  • 44
  • 197
  • 320
  • Why send bytecode at all instead of source? You have to byte-compile it somewhere, and if you do it on the transmitting side, debugging the far side will be a bitch. Also, a quick look at my python experiment directory shows the .pyc files to be slightly longer than their corresponding .py files. Are you out-clevering yourself, are you prematurely optimizing? – msw Jul 19 '15 at 20:37
  • 1
    @msw, Weather its sensible is not the point ~ of course depends on the situation. One reason I asked the question is I wanted to know how easy it would be for a bad actor to embed/hide malicious code in a less obvious (non humanly readable) way. Of course one could just compress the code too, but smaller byte-code may be harder to detect. When auditing for malicious Python code, its useful to know what options are available and what malicious commands may look like. Further there would be some legitimate uses for this too. (where the overhead of compiling on the client is noticeable). – ideasman42 Jul 19 '15 at 22:50
  • Exact use *was* important because we would have closed this question as too broad. Hiding your actual question was disingenuous at best. – msw Jul 19 '15 at 23:13
  • 1
    @msw, I dont see how this is too broad, asking what is possible is valid, since I'm curious about the CPython runtime. There are multiple reasons I wanted to know, I just listed one of them above. – ideasman42 Jul 19 '15 at 23:16
  • "You should only ask practical, answerable questions based on actual problems that you face. Chatty, open-ended questions diminish the usefulness of our site and push other questions off the front page." http://stackoverflow.com/help/dont-ask – msw Jul 19 '15 at 23:20
  • 2
    The question is practical and the answer is straightforward. Use `marshal.loads` and skip the first 12 bytes of a `pyc`. This is not vague or open-ended. – ideasman42 Jul 19 '15 at 23:22

2 Answers2

18

Is there a way to run the data from a pyc file directly?

The compiled code object can be saved using marshal

import marshal
bytes = marshal.dumps(eggs)

the bytes can be converted back to a code object

eggs = marshal.loads(bytes)
exec(eggs)

A pyc file is a marshaled code object with a header

For Python3, the header is 16 bytes which need to be skipped, the remaining data can be read via marshal.loads.


See Ned Batchelder's blog post:

At the simple level, a .pyc file is a binary file containing only three things:

  • A four-byte magic number,
  • A four-byte modification timestamp, and
  • A marshalled code object.

Note, the link references Python2, but its almost the same in Python3, the pyc header size is just 16 instead of 8 bytes.

ideasman42
  • 42,413
  • 44
  • 197
  • 320
Thomas
  • 4,980
  • 2
  • 15
  • 30
  • The result of `eggs` is a code object, not bytes you can read/write. Say I wan't to write eggs to a file or database, and load it later to execute. – ideasman42 Apr 27 '15 at 14:29
  • spam is a string, but the question is asking about using the data from a pyc file. (bytecode, which doesn't have to be parsed & recompiled every time). – ideasman42 Apr 27 '15 at 14:46
  • a string is bytes. If you want to avoid calling compile after loading the 'bytes', then you can just marshal the code object – Thomas Apr 27 '15 at 14:48
3

Assuming the platform of the compiled .pyc is correct, you can just import it. So with a file bar.pyc in the python path, the following works even if bar.py does not exist:

import bar
bar.call_function()
poke
  • 369,085
  • 72
  • 557
  • 602
  • 2
    Yep, I was aware of this, see question **(without having the data on-disk necessarily)** – ideasman42 Apr 27 '15 at 14:27
  • That part of your question wasn’t really clear… “the data” is quite ambiguous. You mean without having the `pyc` on the disk. You could just save the file temporarily to the current directory. – poke Apr 27 '15 at 14:33
  • 1
    Yep, I was being intentionally ambiguous, to avoid answers focusing on the whichever method is used to read in the data bytes. – ideasman42 Apr 27 '15 at 14:39