Pulling Dependencies of AWK scripts using Python

Question

I am working on making a dependency diagram of all scripts and their dependencies on any machine using python and Graphviz. I recently completed connecting python dependencies to their applicable scripts and generating the diagram.

Here's a super zoomed out image of what it's looking like so far: Green - Non-Library Script; Nothing depends on it.
Blue - Library Script
Crimson - Non-Library Script; At least 1 depends on it.

So now it's time to move on to a new language and do the same. I decided to start with AWK.

TL;DR entry point:
How can you determine what script dependencies are used in AWK?

Even a starting place would be amazing, as I have never worked with AWK before. I already have some script to parse out paths and extensions. All I need is the dependency name pulled from the script using it so that I can set up some key:value pairs.

Example data would be something like:

awk_dict = {
            'awk_script_1': ['dependency1', 'dependency2', '...'],
            'awk_script_2': ['dependency1', 'dependency2', '...'],
            ...
            'awk_script_n': ['dependency1', 'dependency2', '...']
           }

Edit: It was requested to show how I am parsing out python scripts.

def main():
    """ Generate a diagram of the server's scripts and their relations. """
    directory_of_execs = get_executables()
    generate_graphviz_diagram(parse_script(directory_of_execs))

This will work for all executables. I later sift out only the ones that were called for in the arguments. It actually works pretty fast, so I've been holding off on moving the sifting methods here.


def get_executables():
    """
    Generate a list of all active executable's location on the server.

    This will only grab the executables that currently have executable rights.

    If you do not have sudo access, some executables may be missed -
    as your permission to the directory and it's contents will be denied.

    Returns:
        A list containing every accessible, active executable's location that is
        not an OS file.
    """

    # Allowed executable file extensions
    allowed_executables = ('.awk', '.c', '.csh', '.inc', '.ln', '.orig', '.pl',
                           '.pm', '.save', '.sh', '.template', '.py')
    applicable_dirs = ('/home/', '/Rusr/', '/usr/')
    exec_info = []
    # Directories listed here are mostly system files or non-important generated files.
    # These have been decided to be ignored. This list is incomplete.
    black_listed = ['redhat', 'RHEL', 'local', 'kernel?', 'lib*', 'python*', '__*__', '*system*']
    for cleared_dirs in applicable_dirs:
        for path, dirs, files in os.walk(cleared_dirs, topdown=True, followlinks=False):
            # Modify current applicable directories in-place with black_listed filters.
            dirs[:] = [
                      d for d in list(dirs) if not any(fnmatch.fnmatch(d, pattern)
                      for pattern in black_listed) if not d.startswith('.')
                      ]
            # Final sift of parsing out desired executables only.
            for executable in files:
                if executable.endswith(allowed_executables):
                    exec_info.append((path, executable))

    return exec_info

This is where I actually pull out the import modules' information from each script and sift out any that cannot compile.


def parse_script(executables):
    """
    Retrieve parameters from the entered script.

    Attributes:
        script_path: A string of the path to a script.
        module_str: A string of only the scripts name and extension.
    Returns:
        script_values: A dictionary of key-values for making a diagram.
    """
    module_container = dict()
    error_scripts = []  # Scripts that cannot dissassemble due to errors within.
    called_scripts = [] # Whitelisted script extensions to add to diagram.

    ## NOTE: There will be more of these soon. Only python is supported right now.
    if ARGS.python:
        called_scripts.append('.py')

    for script_path, module_str in executables:
        # Build a dictionary with the script's  info.
        script_values = dict()
        script_values['name'] = module_str[:module_str.rfind('.')].replace('"', '')
        script_values['extension'] = module_str[module_str.rfind('.'):]
        script_values['path'] = f'{script_path}/{script_values["name"]}{script_values["extension"]}'
        # Dissassemble the script and compile.
        if script_values['extension'] in called_scripts:
            with open(script_values['path']) as file_pointer:
                # Concatenate the script.
                statements = file_pointer.read()
            try:
                # Use dis to pull information on individual scripts.
                cat_mod = dis.get_instructions(statements)
            except Exception as error:
                # If there is a error in the program here it is not caused by
                # this script but the script that is being dissassembled.
                # Log the bad script and the error it pops.
                error_scripts.append(f'SCRIPT :: {script_values["name"]}\nPATH '
                                     f':: {script_values["path"]}\n\t'
                                     f'ERROR INFO ::\n\t{error}')
            else:
                # Sift for information only on imports.
                imports = [module for module in cat_mod if 'IMPORT' in module.opname]
                grouped = defaultdict(list)
                for imp in imports:
                    grouped[imp.opname].append(imp.argval)
                script_values['imports'] = grouped
                # Check for script in module_container
                if script_values['name'] not in module_container:
                    script_values['imports'] = grouped
                    module_container[script_values['name']] = script_values

    return module_container

I am expecting I will have to make a unique function to parse out dependency information for each language. I would like to create some super awesome function that could parse all the languages, but that seems a little out of reach and pylint would probably tell me my function is too big. :(

awk scripts usually don't have dependencies, they're just standalone scripts that read text in and print text out. They can, however, pick up library functions through the AWKPATH variable, or with `-f` or `-i` on the command line etc, or the `@load` directive in the script (if you're using GNU awk), or they can call external programs using `system()` or a print piped to it or calling it as a string piped to getline. So, if you're asking how to write a tool to parse and awk script to find out how it relates to all of that, I think you'd have to write an awk interpreter which would be daunting. — Ed Morton, Aug 24 '21 at 20:36
I'd have expected something very similar to be true for `python`, though, so if you share in your question how you're handling that maybe it'll give use some ideas on what you're looking for and maybe it's not everything I'm imagining (e.g. maybe you're asking about build-time dependencies rather than run-time dependencies?). — Ed Morton, Aug 24 '21 at 20:38
Indeed, what @EdMorton said ... that, to me, looks like a job for a parser. — tink, Aug 25 '21 at 02:44
Thank you for your suggestions, I have added the code I use to parse out dependency information from python scripts. @EdMorton — Bushbaker27, Aug 25 '21 at 14:28
Regarding `allowed_executables ... '.sh'` - shell scripts should not end in `.sh` if that;s what that's intended to find. A shell script is just a command like any other command and so should just be named based on what it does, not how it's implemented. If you create a script to convert CSVs to TSVs using shell it'd be named `csv2tsv` or similar, not `csv2tsv.sh`. If you later decide to reimplement it in C the executable would remain named `csv2tsv`, you wouldn't have to find every other script calling `csv2tsv.sh` and rename it. — Ed Morton, Aug 25 '21 at 14:32
".c" files aren't executable either, and technically neither are ".awk" files. I also wouldn't trust that a file with any given extension is really an executable since nothing in Unix uses extensions that way (e.g. maybe `foo.sh` is a shopping list) . You can't find executables by extensions in Unix, you'd have to do something like `find / -type f -executable` and then probably additionally run `file` on them. — Ed Morton, Aug 25 '21 at 14:37
Thanks! I'll be sure to go back and change that code. I am a little new to UNIX (just over a year of experience with it) so any tips are greatly appreciated. I have to find a way to pull executables on any distro. Dev server is RHEL and production is FreeBSD. I initially tried the ` find / -type f -executable` but it did not translate well to FreeBSD. — Bushbaker27, Aug 25 '21 at 14:59
@Bushbaker27 that just means you didnt use GNU `find`, you can install it on a BSD system but see also https://stackoverflow.com/questions/4458120/search-for-executable-files-using-find-command. The problem with that though is that it seems like you actually are NOT trying to just find executables, e.g. you want to find awk scripts but those are interpreted, not executable, so you may be looking at using `find . -type -exec file {} \;` piped to an awk script or something to select the files you're really interested in. — Ed Morton, Aug 25 '21 at 15:13
Having said that - `file` can't always tell an awk script from any other ascii text file (I just tried it on a couple). You should probably ask a question... depending on what you want to use this for, maybe it doesn't have to be 100% robust. — Ed Morton, Aug 25 '21 at 15:19
It's also not clear how you'd deal with a shell script that calls python and/or perl and/or awk and/or a C program or anything else even in the unlikely and undesirable-in-general case that the shell script name DID end in `.sh` - you'd have to parse the shell script to find which tools it calls then parse the command-line options and scripts (or binary if its a compiled C program) to find out what each of those tools/scripts/images depend on. A nightmare! — Ed Morton, Aug 25 '21 at 16:32
Haha so far this project has been a bit of a nightmare. I think I'll take this to my boss and ask if he really wants something this robust and if it is actually feasible. Thank you for all your help! I have a good idea where to start now. — Bushbaker27, Aug 25 '21 at 17:40

Pulling Dependencies of AWK scripts using Python

0 Answers0