1

I am writing a script to search data provided by users may contain Python code which may be calling modules. I am able to retrieve the code as a string, but I need a reliable way to determine what modules/packages are used within it so I can compare to what is installed on my system. This is proving to be a seemingly insurmountable task. Especially since I am essentially restricted to default packages for a Python 3.x install.

I'm sure there has to be a way without being relegated to heuristics or regex. I am hoping for a method that will parse the string and be able to return it using pkgutil or the like. I tried here, but it apparently REALLY doesn't like having a string passed to it instead of a file path. I was contemplating trying to convert the string to a filestream in memory and passing that, but I'm not sure that would be any better.

Adalast
  • 151
  • 1
  • 1
  • 8
  • 1
    No, this is not possible. Arbitrary python code can do arbitrary stuff. It could, for example, ask a webserver which packages it's supposed to import. You can make guesses about what is getting imported, but without running the code you can never be 100% sure. You can however run it and listen for `ImportError`s – MegaIng Aug 15 '23 at 21:48
  • Thank you @MegaIng. I just went with RegEx to find "from * import *" and "import *,*,*" syntax substrings within the strings I am getting for the code in the customer files. Thanks for the info on the security risks, luckily everything I'm doing this for is running in airgapped dockers, so all good on the security. This is just for failing the job before it starts if the packages are not present on our systems. – Adalast Aug 17 '23 at 02:06

1 Answers1

0

Given that there is no way to actually parse the Python code as code from within a script to look for commands, I went with parsing the string using RegEx.

The RegEx patterns I used were (?<=from )(\w+) and (?:import )(\w+)(,\s*\w+)* which respectively find patterns for from package import module and import package,package,etc. The latter handles single imports or a list of them and the former only extracts the package name and skips the module name for from package.module syntax.

I used re.findall() to pull out all of the instances of each then set.add() to make sure only to keep unique instances. So all together it looks like this:

import re
from collections.abc import Iterable

re1 = r"(?<=from )(\w+)"
re2 = r"(?:import )(\w+)(,\s*\w+)*"

packages = set()

for line in string.readlines():
    reExecute1 = re.findall(re1, line)
    reExecute2 = re.findall(re2, line)
    #  This line is needed so the "import *" on "from * import *" does not include 
    #  the modules in the final set
    foundPackages = reExecute1 if reExecute1 else reExecute2 if reExecute2 else None
    if foundPackages:
        insertPackagesToSet(*m,packages)

def insertPackagesToSet(items,s):
    if isinstance(item, Iterable) and not isinstance(item, str):
        # This one is needed because the comma separated regex does not strip the commas
        strippedItem = map(str.strip,item, [", "]*len(item))
        any(map(s.add, strippedItem))
    else:
        s.add(item) 
mkrieger1
  • 19,194
  • 5
  • 54
  • 65
Adalast
  • 151
  • 1
  • 1
  • 8
  • You can use the stdlib `ast.parse` function to get a syntax tree and walk that instead if you prefer that method. – MegaIng Aug 17 '23 at 03:36