0

Is it possible to open and read cythonized .so files with python?

The use-case is a test that scans all python files in a directory and evaluates if certain object attributes are used (to be ultimately able to identify and remove unused attributes). This test runs perfectly on the local environment but in our CI that cythonizes all files this breaks, as .so files can't be parsed.

Currently I am scanning the files for the object attribute occurrences like this:

import os

path = '/path/to/dir'
attribute_regex = r'object\.(\w+)'

used_attributes = set()
for root, _, files in os.walk(path):
    for file in files:
        with open(os.path.join(root, file), 'r') as f:
            used_attributes.update(re.findall(attribute_regex, f))

Maybe I am looking at this issue from the wrong angle, are there other more sophisticated ways to check if attributes of an object are used across multiple python files?

gustavz
  • 2,964
  • 3
  • 25
  • 47
  • Does this answer your question? [Is there a way to read the contents of .so file without loading it?](https://stackoverflow.com/questions/19396048/is-there-a-way-to-read-the-contents-of-so-file-without-loading-it) – mkrieger1 Sep 06 '22 at 13:47
  • 1
    Is it safe/okay to just import the files? As written, you'll identify all sorts of attributes in plain Python files that are purely local variable names (no one can actually use them from outside the module). Simply importing the module and running `dirs` or `vars` on it would give more useful information. By the time something is Cythonized, the non-public names don't *exist* anymore (the original source code is not required anymore, so the `.so` is just what you get from compiling the generated C++, that does not need to use/expose any useful names). – ShadowRanger Sep 06 '22 at 13:50
  • @mkrieger1 yes the question addresses the same issue with reading .so files, but does not provide answers that solve it for my case, does it? – gustavz Sep 06 '22 at 13:50
  • @mkrieger1: Not a good duplicate; Python extension modules are only required to expose a single name for the dynamic loader (the module entrypoint, which has a predictable name in any event), so you'd get nothing useful (from the standpoint of determining which names are exposed at the Python layer) from any of the tools listed there. Knowing that the `spam` module exposes `PyInit_spam` is not super-helpful. – ShadowRanger Sep 06 '22 at 13:55
  • @ShadowRanger am I able to see if methods used in imported modules use certain object attributes internally? – gustavz Sep 06 '22 at 13:56
  • @gustavz: Generally not; once dropped to the C layer, many attribute accesses aren't actually done with Python-like attribute access. C level Python objects are represented as C `struct`s, and attribute access on them (from the Python layer) is handled through special accessor methods (equivalent to C level `PyObject_GetAttr`), but when the C layer knows the `struct` itself, it typically bypasses the attribute lookup APIs and just accesses the `struct` member directly. Once compiled, that `struct` member lookup is just looking up through a pointer at a fixed offset, no names involved. – ShadowRanger Sep 06 '22 at 14:01
  • @gustavz: Can you explain why you need to do this? This feels like [an XY problem](https://meta.stackexchange.com/q/66377/322040). I'd expect a static analyzer for the original source code would be more useful (though I'll admit I don't know off-hand of static analyzers that work on Cython syntax source code). – ShadowRanger Sep 06 '22 at 14:02
  • 1
    Oh, minor side-note: Your regex is slightly off. Aside from `object` being a fixed name, you shouldn't allow `\w` for the whole of the attribute; the first character can't be a digit, so you want to match `\.([^\W\d]\w*)'` instead of `\.(\w+)'` (`[^\W\d]` is the Unicode friendly way to say you accept underscore and alphabetic characters, but not numeric characters, for the first character; after the first character, `\w` is correct). – ShadowRanger Sep 06 '22 at 14:10
  • This should just be a simple test that evaluates if all attributes of a huge object are used. We are highly interested in removing unused attributes. Is this explanation sufficient? – gustavz Sep 06 '22 at 14:12
  • I doubt you'd have too much luck scanning .so files. If I were doing this I'd overload `__getattribute__` and add some logging there based on runtime usage. – DavidW Sep 06 '22 at 17:33
  • Alternatively, why not use your existing regex on the .pyx files that get compiled to the .so – DavidW Sep 06 '22 at 17:35
  • there are no .pyx files in the CI @DavidW, only .so – gustavz Sep 08 '22 at 16:06

0 Answers0