0

I have issues understanding some subtleties of the Python import system. I have condensed my doubts around a minimal example and a number of concrete and related questions detailed below.

I have defined a package in a folder called modules, whose content is an __init__.py and two regular modules, one with general functionality for the package and other with the definitions for the end user. The content is as simple as:

init.py

from .base import *
from .implementation import *

base.py

class FactoryClass():
    registry = {}

    @classmethod
    def add_to_registry(cls, newclass):
        cls.registry[newclass.__name__] = newclass

    @classmethod
    def getobject(cls, classname, *args, **kwargs):
        return cls.registry[classname](*args, **kwargs)


class BaseClass():
    def hello(self):
        print(f"Hello from instance of class {type(self).__name__}")

implementation.py

from .base import BaseClass, FactoryClass

class First(BaseClass):
    pass

class Second(BaseClass):
    pass

FactoryClass.add_to_registry(First)
FactoryClass.add_to_registry(Second)

The user of the package will use it as:

import modules

a = modules.FactoryClass.getobject("First")
b = modules.FactoryClass.getobject("Second")
a.hello()
b.hello()

This works. The problem comes because I'm developing this, and my workflow includes adding functionality in implementation.py and then continaully test it by reloading the module. But I can not understand/predict what module I have to reload to have the functions updated. I'm making changes that have no effect and it drives me crazy (until yesterday I was working on a large .py file with all code lumped together, so I had none of these problems).

Here are some test I have done, and I'd like to understand what's happening and why.

First, I start commenting out all mentions to Second class in implementation.py (to pretend it was not yet developed):

from importlib import reload
import modules

modules.base.FactoryClass is modules.FactoryClass  # returns True
modules.FactoryClass.registry    # just First class is in registry 

a = modules.FactoryClass.getobject("First")
b = modules.FactoryClass.getobject("Second")   # raises KeyError as expected

This code and its output is pretty clear. The only thing that puzzles me is why there is a modules.base module at all (I did not import it!). Further, it is redundant as their classes point to the same objects. Why importing modules also imports modules.base and modules.implementation as separate but essentially identical objects?

Now things become interesting as I comment out, i.e. I finish developing Second, and I'd like to test it without having to restart the Python session. I have tried 3 different reloads:

reload (modules)

This does absolutely nothing. I'd expect some sort of recursivity, but as I have found in many other threats, this is the expected behavior.

Now I try to manually reload one of those "unexpected" modules:

reload (modules.implementation)
modules.base.FactoryClass is modules.FactoryClass     # True
modules.FactoryClass.registry                         # First and Second

a = modules.FactoryClass.getobject("First")      
b = modules.FactoryClass.getobject("Second")          # Works as expected

This seems to be the right way to go. It updates the module contents as expected and the new functionality is usable. What puzzles me is why modules.FactoryClass has been updated (its registry) despite the fact that I did not reload the modules.base module. I'd expect this function to stay "outdated".

Finally, and starting from the just freshly uncommented version, I have tried

reload (modules.base)
modules.base.FactoryClass is modules.FactoryClass    # False
modules.FactoryClass.registry  # just First class is in registry
modules.base.FactoryClass.registry  # empty 

a = modules.base.FactoryClass.getobject("First")
b = modules.base.FactoryClass.getobject("Second")   # raises KeyError

This is very odd. modules.FactoryClass is outdated (Second is unknown). modules.base.Factory is empty. Why are now modules.FactoryClass and modules.base.FactoryClass different objects?

Could someone explain why the three different versions of reload a package have so different behaviour?

Pythonist
  • 1,937
  • 1
  • 14
  • 25
  • 1
    `Why importing modules also imports modules.base and modules.implementation`. That's simply how it works. When you import a package you get access to the modules in that package. The same reason you can do `import os` and get access to `os.path`. People would have to be doing a whole lot of extra importing without that. – Kemp Jun 21 '21 at 14:17
  • But if you (as developer of the package) want to give the users access to the modules you can do so with the __init__.py file. Giving access to all modules by default results in less explicitness. And more importantly it’s a source of confusion, as there are module components that are duplicated, sometimes being identical sometimes not (as my examples illustrate). Don’t get me wrong, I’m not complaining or suggesting this is a bug. I just try to understand the rationale of all this, and I still don’t see it. – Pythonist Jun 21 '21 at 22:05

1 Answers1

0

You are confused about how the Python import system works, so I strongly recommend you read the corresponding documentations : the import system and importlib.reload.

A foreword : code hot-reloading in Python is tricky. I recommend to not do that if it is not required. You have seen it yourself : bugs are very tricky.

Then to your questions :

Why importing modules also imports modules.base and modules.implementation as separate but essentially identical objects?

As @Kemp answered as a comment (and I upvoted), imports are transitive. When you import a, Python will parse/compile/execute the corresponding library file. If the module does import b then Python will do it again for the b library file, and again and again. You don't see it, but when your program starts there is already a lot of things that have been imported.

Given this file :

print("nothing else")

When I set my debugger to pause before executing the print line, if I look into sys.modules I already have 338 different libraries imported : builtins (where print came from), sys, itertools, enum, json, ...

Understand that "no visible import statement" does not mean "nothing have been imported".
When you execute import a, Python will start by checking its sys.modules cache to determine if the library have already been read from disk, parsed, compiled and executed into a module object. If this library was not yet imported during this program, then Python will take the time to do all that. But because it is slow, Python optimize with a cache.
The result is a module object, that gets bind the current namespace, so that you can access it.
We can summerize it like that :

def import_library(name: str) -> Module:
    if name not in sys.modules:
        # cache miss
        filepath = locate_library(name)
        bytecode = compile_library(filepath)
        module = execute(bytecode)
        sys.modules[name] = module
    # in any case, at this point, the module is available
    return sys.modules[name]

You are thus confusing module objects with variables.
In any module you can declare variables with whatever name (but allowed by Python's grammar). And some of them will reference modules.

here is an example :

# file: main.py

import lib  # create a variable named `lib` referencing the `lib` library
import lib as horse  # create a variable named `horse` referencing the `lib` library
print(lib.a.number)  # 14
print(horse.a.number)  # 14
print(lib is horse)  # True

print(lib.a.sublib.__name__)  # lib.sublib
import lib.sublib
from lib import sublib
import lib.sublib as lib_sublib
print((lib.sublib is sublib, sublib is lib_sublib, lib.a.zebra is sublib))  # (True, True, True)

import sys
print(sys.modules["lib"] is lib)  # True
print(sys.modules["lib.sublib"] is sublib)  # True

print(lib.sublib.package_color)  # blue
print(lib.sublib.color)  # AttributeError: module 'lib.sublib' has no attribute 'color'
# file: lib/__init__.py
from . import a
# file: lib/a.py
from . import sublib
from . import sublib as zebra
number = 14
# file: lib/sublib/__init__.py
from .b import color as package_color
# file: lib/sublib/b.py
color = "blue"

Python offers a lot of flexibility about how to import things, what to expose, how to access. But I admit it is confusing.
Also take a look at the role of __all__ in __init__.py. Given that, you should now understand your question on subpackage naming/visibility.



reload (modules) This does absolutely nothing. I'd expect some sort of recursivity, but as I have found in many other threats, this is the expected behavior.

Given what I explained, can you now understand what it does ? And why what it does is not what you want it to do ?

Because what you want is to get modules.implementations hot-reloaded, but you ask for modules.

>>> from importlib import reload
>>> import lib
>>> lib.sublib.package_color
'blue'
>>> # I edit the "b.py" file so that the color is "red"
>>> old_lib = lib
>>> new_lib = reload(lib)
>>> lib is new_lib, lib is old_lib
(True, True)
>>> lib.sublib.package_color
'blue'
>>> lib.sublib.b.color
'red'
>>> import sys
>>> sys.modules["lib.sublib.b"].color
'red'

First, the top-level reload did not work, because what the file only did was import sublib, which hit the cache, so nothing really gets done.
You have to reload the actual module for its content to takes effect. But it does not work magically : it will create new objects (module-level definitions) and put them into the same module object, but it can't update references that may exist on the preceding module's content. That is why we see a "blue" even after the module has been reloaded : the package_color is a reference to the first version's color variable, it does not get updated when the module is reloaded. This is dangerous : there may be different copies of similar things lying around.



why modules.FactoryClass has been updated

You are reloading modules.implementation in this case. What happens is that it reloads the whole file to populate the module object, I highlighted the perceived effects :

from .base import BaseClass, FactoryClass  # relative library "base" already in the `sys.modules` cache

class First(BaseClass):  # was already defined, redefined
    pass

class Second(BaseClass):  # was not defined yed, created
    pass

FactoryClass.add_to_registry(First)  # overwritting the registry for "First" with the redefinition of class `First` 
FactoryClass.add_to_registry(Second)  # registering for "Second" the definition of class `Second`

You can see it another way :

>>> import modules
>>> from importlib import reload
>>> First_before = modules.implementation.First
>>> reload(modules.implementation)
<module 'modules.implementation' from 'C:\\PycharmProjects\\stack_overflow\\68069397\\modules\\implementation.py'>
>>> First_after = modules.implementation.First
>>> First_before_reload is First_after_reload
False

When you are reloading a module, Python will re-execute all the code, in a way that may be different than the previous time(s). Here, each time you are doing FactoryClass.add_to_registry so the FactoryClass gets updated with the (re)definitions.



Why are now modules.FactoryClass and modules.base.FactoryClass different objects?

Because you reloaded modules.base, creating a new FactoryClass class object, but the from .base import BaseClass, FactoryClass does not get reloaded, so it is still using the class object from before the reload.
Because you reloaded, you got yourselves copy of everything. The problem is that you still have lingering references to versions of before the reload.



I hope it answers your questions.
Import is not easy, but reloading is notably tricky.

If you truly desire to reload your code, then you will have to take extra extra extra care to correctly re-import everything, in the correct order (if such an order exist, if there is not too much side-effects). You need a proper script to update everything correctly, and in case it does not work perfectly, you will frequently have horrible, sad and mind-bending bugs.

But if you prefer keep your sanity, if your workflow does not require you to reload parts of the program, just close it and restart it. There is nothing wrong with that. It is the simple solution. The sane solution.

TL;DR: don't use importlib.reload if you don't know how it works exactly and understand the risks.

Lenormju
  • 4,078
  • 2
  • 8
  • 22