1

I'm writing a data manipulation package based on python pandas. For the part which has a functional style, I would like to make my package hierarchy flatter. Currently functions need to be imported using calls such as:

from package.module.submodule import my_function

The proposed change would make it possible to import

from package import my_function

To achieve this, functions and other objects would be imported into package/__init__.py so that they are available in the top level name space. This is how pandas does it, for example pandas/__init__.py makes it possible to import

from pandas import DataFrame

When in fact, the DataFrame class is defined inside pandas.core.frame. You would normally have to import it like this: from pandas.core.frame import DataFrame, but since it's imported in the top level __init__.py it's made available at the top level.

Making functions available as top level imports:

  • would expose a flat hierarchy for users and would make it easier to use the package

  • but internally (in the package code) we should not import from package/__init__.py directly to avoid creating circular references.

    • Searching for from+pandas+import It seems that pandas always avoids importing from the top level (except test scripts which do use from pandas import DataFrame). I don't know how to enforce this.
    • Maybe this tool can be helpful: pylint-forbidden-imports,
    • or rather flake8-tidy-imports since we are using black and flake8 as a pre commit hook. flake8-tidy-imports makes it possible to define which imports are forbidden. It seems it applies to the whole package though, and not to a specific location in the package.

Related questions

Paul Rougieux
  • 10,289
  • 4
  • 68
  • 110
  • 1
    2c: this seems like a bad idea but also asking for tool suggestions on SO is off topic as it leads to spammy/advertisery answers – anthony sottile Mar 27 '22 at 12:52
  • @AnthonySottile I am asking for advice on how to expose functions as top level imports. I added the link to 2 related questions which seem to show that this is not necessarily a bad idea. – Paul Rougieux Mar 28 '22 at 08:40
  • your question is "how can I prevent", tagged [flake8], and listed some tools -- how is this not a "please suggest to me a tool" question? – anthony sottile Mar 28 '22 at 13:22
  • 1
    At this stage, I am not even sure I am asking the right question. I would like to know if this is a good approach. I think it is because other packages such as pandas are doing it. Then if I should use the tool or not and how to use the tool. – Paul Rougieux Mar 29 '22 at 08:26
  • well then that would be "opinion based" and also closed :) – anthony sottile Mar 29 '22 at 13:23
  • @AnthonySottile I'm here to learn. Could you please point out a place that explains why this is a bad idea? – Paul Rougieux Mar 31 '22 at 11:49
  • 1
    it's a whole load of complexity and makes it way too easy to introduce cycles (even if you don't "import from `__init__`" it's going to be implicitly imported any time you deal with any module in your hierarchy. all to save a few characters of keyboard-typing? – anthony sottile Mar 31 '22 at 13:38
  • 2
    It seems like the functionality you want is served by the post in your [2nd related question](https://stackoverflow.com/questions/44834/can-someone-explain-all-in-python/35710527#35710527). Do you mean you want a way to ensure developers working on your package don't break the "top level import" rule? – tdpu Apr 02 '22 at 02:20

3 Answers3

3

I think the concern you are expressing is the fact that "importing a sub-module" and "importing a sub-module during the import of a module" are not the same thing. For example, writing this in ipython:

from module.sub.file import func

and writing this from within the module package

from module.sub.file import func

do not do the same thing (even though they look the same). This is because if module has already started its' initialization; then subsequent calls to a sub-module will not re-initialize module nor does module need to have finished initializing before calling its' submodules. This is very similar to how class inheritance works too.

This means that it is perfectly valid for a package to pull various functions from all of its' sub-modules, while each of its' sub-modules can explicitly import from each other through the package itself without causing an infinite loop. This is by design. Example,

module
    __init__.py
        from .sub1.file1 import func1
        from .sub2.file2 import func2
    sub1
        __init__.py
        file1.py
            from module.sub2.file2 import func2
            def func1(x):
                return func2(x)+x
    sub2
        __init__.py
        file2.py
            def func2(x):
                return x+1

Here the sub-module sub1 is dependent on sub2. The line from module.sub2.file2 import func2 normally means

  1. execute module/__init__.py and load from namespace sub2
  2. execute module/sub2/__init__.py and load from namespace file2
  3. execute module/sub2/file2.py and load from namespace func2

but during a call of from module import func1, when we reach the line in file1.py of from module.sub2.file2 import func2 we have either already ran or in are in the middle of running module/__init__.py and module/sub/__init__.py. This means that line more like does:

  1. module/__init__.py is currently being executed...skip
  2. module/sub2/__init__.py was already loaded...skip
  3. execute module/sub2/file2.py and load from namespace func2

In general, if module/__init__.py is currently being executed, then additional calls to that statement will simply be skipped. You can quite literally import module itself and it will be outright skipped even though it hasn't finished loading itself. Add some print statements

module
    __init__.py
        print('start init module')
        from .sub1.file1 import func1
        from .sub2.file2 import func2
        print('end init module')
    sub1
        __init__.py
        file1.py
            print('loading module from file1.py')
            import module
            print('done loading module from file1.py')
            from module.sub2.file2 import func2
            def func1(x):
                return func2(x)+x
    sub2
        __init__.py
        file2.py
            def func2(x):
                return x+1

Now run from module import func1:

start init module
loading module from file1.py
# Notice that nothing is printed here, meaning module/__init__.py was not run again, 
# even though we explicitly wrote "import module", additionally "module" wasn't 
# even finished executing it's own __init__.py file. 
done loading module from file1.py
end init module

This is awesome from a design perspective. It means that sub2 could have very well been a package completely separate from module but dependent on module. Then at some point, someone was like, "lets drop that indpendent package as sub-module to our module package". The entire folder is just dropped in (without changing any code) and then module can import from it like it is a local sub-package without any care of creating an import loop by accident, even though the sub-package depends on other parts of module itself.

Bobby Ocean
  • 3,120
  • 1
  • 8
  • 15
1

Your problem is exactly stated in the documentation and solved by using intra-package referencing. You refer to the sub-modules using

from ..frame import DataFrame

instead of using

from pandas.core.frame import DataFrame

I can see it also worked for people here.

This type of referencing is used normally and this is an example by Baidu team in their OCR engine to import all modules.

Stick to your idea because referring to other paths is harsh if your users are beginners.

Esraa Abdelmaksoud
  • 1,307
  • 12
  • 25
-1

I've searched through the pandas GitHub repository and was unable to find a pre-commit hook that addresses your specific problem, so I adapted the use-pd_array-in-core hook from the pandas repository.

My test setup has the following folder structure (excluding the .git folder):

├── package
│   ├── __init__.py
│   ├── module
│   │   └── submodule.py
│   └── second_module
│       └── submodule.py
├── .pre-commit-config.yaml
└── scripts
   └── import_from_submodules.py

The .pre-commit-config.yaml contains

repos:
-   repo: local
    hooks:
    -   id: import-from-submodules
        name: Import from appropriate submodules
        language: python
        entry: python scripts/import_from_submodules.py package
        files: ^package/
        types: [python]

and the the import_from_submodules.py file contains

"""
Check that all imports reference the correct submodule and not import directly
from __init__.py, even though that is technically possible.

This is meant to be run as a pre-commit hook - to run it manually, you can do:

    pre-commit run import-from-submodules --all-files

"""

from __future__ import annotations

import argparse
import ast
import sys
from typing import Sequence


class Visitor(ast.NodeVisitor):
    def __init__(self, package_name: str, path: str) -> None:
        self.package_name = package_name
        self.path = path
        self.error_message = (
            "{path}:{lineno}:{col_offset}: "
            f"Don't import from {self.package_name}, "
            f"import from {self.package_name}.submodule instead\n"
        )

    def visit_Import(self, node: ast.Import) -> None:
        if any(module.name == self.package_name for module in node.names):
            msg = self.error_message.format(
                path=self.path, lineno=node.lineno, col_offset=node.col_offset
            )
            sys.stdout.write(msg)
            sys.exit(1)
        super().generic_visit(node)

    def visit_ImportFrom(self, node: ast.ImportFrom) -> None:
        if node.module == self.package_name:
            msg = self.error_message.format(
                path=self.path, lineno=node.lineno, col_offset=node.col_offset
            )
            sys.stdout.write(msg)
            sys.exit(1)
        super().generic_visit(node)


def import_from_submodules(package_name: str, content: str, path: str) -> None:
    tree = ast.parse(content)
    visitor = Visitor(package_name, path)
    visitor.visit(tree)


def main(argv: Sequence[str] | None = None) -> None:
    parser = argparse.ArgumentParser()
    parser.add_argument("package_name")
    parser.add_argument("paths", nargs="*")
    args = parser.parse_args(argv)

    for path in args.paths:
        with open(path, encoding="utf-8") as fd:
            content = fd.read()
        import_from_submodules(args.package_name, content, path)


if __name__ == "__main__":
    main()

This uses the ast module to parse the Python source code of every python file in the package directory and visits each import <module> and from <module> import <function> statement. If the <module> part equals the package name (which is a command-line parameter of the script you can set in the pre-commit config), the position of the offending line is printed and the script exits with a nonzero exit code to indicate that there are errors.

Let's say there is a function fun inside package/module/submodule.py, which is also imported in __init__.py and included in __all__. Inside package/second_module/submodule.py the following lines would raise an error if you run pre-commit run import-from-submodules --all-files:

import package
from package import fun
from ..package import fun

whereas

from package.module.submodule import fun
from ..module.submodule import fun

do not. Note that the relative import examples illustrates that all leading dots of relative imports are ignored when comparing the module name to the given package name.

I hope this covers your use case. You are of course welcome to change the error message to something more helpful/clear. The ast module is extremely powerful, if you want to extend this code. For example the use-pd_array-in-core pre-commit hook mentioned earlier also flags all expressions pd.array by checking every attribute access.

BurningKarl
  • 1,176
  • 9
  • 12