0

I am working on a project where I have to determine whether a particular call or import is:

  1. From the standard library of the language (Python) I'm using. (I am already considering to use sys.stdlib_module_names for it)
  2. From a third-party library, or
  3. An API call made to some service from within the repository.

Is there an efficient way or tool that could help me quickly differentiate between these types of calls or imports? I'm primarily using Python, but methods for other languages are welcome as well.

I am working on a project where I have to collect a dataset of library calls that are made within within that repository.

I am working on a project wherein I aim to compile a dataset of function calls made within a given repository from Github.

So at first, I download any given python repository from Github.

Then my main objectives are:

  • To extract all function calls made within the target repository.
  • To gather details of these function calls, including the arguments they use.
  • For this purpose, I am employing the Python AST (Abstract Syntax Tree) parser to detect and catalogue function calls and their respective arguments.
  • My entire analysis pipeline is based within a Python script leveraging the AST module.
  • Now I have to determine which of these function calls originate from within the repository itself.

For example, if there is a call

file_b.py

def abc():
  ....

file_a.py

import numpy as np
from file_b import abc
....
def foo():
   ..
   x = np.linspace(-math.pi, math.pi, 2000)
   y = np.sin(x)
   ...
   ..
   c = abc()

I want to only capture abc (as it is defined in that repository) and not the calls from numpy module.

blhsing
  • 91,368
  • 6
  • 71
  • 106
Exploring
  • 2,493
  • 11
  • 56
  • 97
  • 6
    Why do you need to make this determination? What problem will you solve by knowing? And why is this tagged Java? – Karl Knechtel Aug 13 '23 at 05:07
  • I would have said the best way to do this was to build up a cross-reference -- a concordance that includes all of the APIs in your library. Then, you can do simple text searches. Note, however, that an app that does `from xxx import *` can make it impossible to know this kind of information without some sophisticated analysis. – Tim Roberts Aug 18 '23 at 05:29
  • how to populate that cross-reference list though? I am not aware of any library that can do it. – Exploring Aug 18 '23 at 05:56
  • 1
    Could you add some details of supposed activity, a use case? For example, do you suppose to work with code as text, or inside running code with alive objects? How do you collect calls? Do you differentiate functions and methods? etc. – Vitalizzare Aug 18 '23 at 16:34
  • We need context for why you want this and what it solves. Is it for static analysis? License checking? Import ordering? Something else? Without that information, none of the answers provided are likely to be actually useful, because the scope of the problem is too vague. Although a generic solution may exist, it becomes immediately more obvious what that generic solution could be when considering an actual use-case. – flakes Aug 21 '23 at 22:56
  • Have you looked at the likes of pylance or some other Python language server; the type of software that makes the refactoring / static analysis smarts for IDEs such as VSCode? – Cameron Kerr Aug 24 '23 at 11:10

2 Answers2

4

You can use inspect since this module seems written for your purposes in mind. A trivial way to differentiate is using the location in the disk of the library given a function using inspect by e.g. :

import os 
import inspect

# "standart" library 
import numpy as np
# some "local" library 
import cfg
# we can assign it on a variable if needed 
foo=np 
print(1, os.path.dirname(os.path.abspath(foo.__file__))) 
foo=cfg 
print(2, os.path.dirname(os.path.abspath(foo.__file__)))
print()
#we can get the module from any function
unknown_function = np.sort
the_module = inspect.getmodule(unknown_function) 
print(the_module) 
print(3, os.path.dirname(os.path.abspath(the_module.__file__)))

result is:

1 /home/datalab/workspace/conda/lib/python3.8/site-packages/numpy
2 /home/datalab/workspace/utils

<module 'numpy' from '/home/datalab/workspace/conda/lib/python3.8/site-packages/numpy/__init__.py'>
3 /home/datalab/workspace/conda/lib/python3.8/site-packages/numpy

In your case, you seem to have 3 categories.

  • The first one should/will be originating from the conda/pip installation (you may check the location of your environment using sys.executable)
  • The second from the third-party-library, that should result to a well known path prefix
  • The third would be within the project repository, which may be well known or e.g. by running subprocess.check_output(['git', 'rev-parse', '--show-toplevel']) (from within the repository).

Inspect can do a lot more than give you the location on the disk in more complex situations. Here is an example along with some code. In PythonModuleOfTheWeek there are more uses and here you can find some further examples.

A practical note: importing a module means running foreign code, so make sure you trust the code, or you run it using some sandboxed environment/manner. But how to do the later is a question on its own.

A theoretical note: In extreme cases this problem is I think undecidable. The formal proof might involve using a halting function and another non-halting. Any analysis that could discriminate between the two, would therefore solve Turing's halting problem. For our case using inspect, this means that there exist modules that importing them can take potentially forever. Practically this should not be a problem because any reasonable module should be able to be imported in reasonable time.

ntg
  • 12,950
  • 7
  • 74
  • 95
  • I wonder why my previous comment got deleted without any explanation. I raised what I believe is a legitimate issue about this answer, that the OP specifically asks about a static analysis of a repository (in the objectives section), while the solution suggested by this answer requires actually running the code after modifications, which may require a proper setup of environments, configurations and external dependencies, may produce side effects, and still would not cover all the calls if some calls are made conditionally. – blhsing Aug 25 '23 at 04:14
  • I do not think it was me to delete the comment though (or that i can do that unless this is recent?). I wrote a comment about how I really appreciated the fact that you wrote about the reasons for your downgrade although I do not agree, then mentioned that imho the question was about how to identify a third library call even while there was static analysis. Then I read your comment better and realised you had concerns about running the code instead of just analysing it, so I deleted my comment and planned to write a new involving how security can be addressed and why running the code is needed – ntg Aug 25 '23 at 08:15
  • 1
    The argument to explain what the question asks can only done by "running" and not just static analysis has to do with immutability as offered by languadges such as ocaml, etc and it needs parts of my computer science knowledge i have not touched for a while. Essentially, supose a variable points to a function. Python code can change the value of that function in a way that cannot be predicted by static analysis alone (because of Turing Completeness of Python and the fact python functions are not immutable -- i think , I am rusty with proper proofs there) Therefore same as deciding on halting, – ntg Aug 25 '23 at 08:28
  • 1
    you cannot decide the value of a function by static analysis alone (e.g. suppose the function is the halting function, then follow Turing's halting undecidability). Of course that does not meen you should just run any code in your computer.... That is why I deleted my response to your comments and was planning to write one better involving how you should run untrusted code in a sandbox, but that is not part of the question, rather a new one.... – ntg Aug 25 '23 at 08:32
0

Pylint from https://www.pylint.org/ provides the static analysis tool you need along with numerous Editor and IDE integrations.

Pylint output can be pushed to a text file, and you can customize the format of the output and then parse it with a customized script. Said script could isolate and flag lines of output that have to do with your 3 categories, or other categories and specifics you wish to call out of the output log.

The configuration options include standard checkers and extensions, which you can also write,

  • You can tell it to ignore specific modules (--ignored-modules)
  • You can add paths to the list of source roots (--source-roots) used to determine package namespace for modules located under the source roots
  • You can generate a graph of dependencies for a given file (--int-import-graph)
  • You can force import order to recognize a module as part of a third party library (--known-third-party)
  • And much more! See for yourself.

As mentioned above, transform modules are a type of Pylint plugin that can be tailored toward a specific module or library of framework. Additionally, custom checkers can analyse a module as a raw file stream, as a series of tokens (stream), or as an AST that works on the AST representation of the module. See: https://pylint.pycqa.org/en/latest/development_guide/how_tos/custom_checkers.html#write-a-checker and pylint plugin to warn of specific function use?

Note that, when writing your scripts, you may make use of inspect or dir() function to inspect modules to help identify where they have come from. See: https://www.javatpoint.com/list-all-functions-from-a-python-module

For example:

import module
dir(module)

Or:

from inspect import getmembers,isfunction
import stats
print(f for f in getmembers(stats) if isfunction(f1]))

You can also use regex and string parsing to examine output logs of pylint and handle them accordingly. Though I mentioned this previously, I wanted to emphasize this.

AST - abstract syntax trees - help Python applications to process trees of the Python abstract syntax grammar, A python AST can be traversed and each node and its node can be traced to a source. See these other answers for additional information pertaining to module source determination from an AST:

You can also learn more about using AST in this medium article: https://medium.com/@wshanshan/intro-to-python-ast-module-bbd22cd505f7

and also in this Pybit.es article: https://pybit.es/articles/ast-intro/