3

I have a package which is like

mypkg
    |-mypkg
        |- data
            |- data.csv
            |- __init__.py  # Required for importlib.resources 
        |- scripts
            |- module.py
        |- __init__.py

The module module.py requires data.csv to perform a certain task.

The first naive approach I used to access data.csv was

# module.py - Approach 1
from pathlib import Path

data_path = Path(Path.cwd().parent, 'data', 'data.csv')

but this obviously breaks when we have imported module.py via from mypkg.scripts import module or similar. I need a way to access data.csv regardless of where mypkg is imported from.

The next naive approach is to use __file__ attribute to get access to the path wherever the module.py module is located.

# module.py - Approach 2
from pathlib import Path

data_path = Path(Path(__file__).resolve().parents[1], 'data', 'data.csv')

However, researching around about this problem I find that this approach is discouraged. See, for example, How to read a (static) file from inside a Python package?.

Though there doesn't seem to be total agreement as to the best solution to this problem, it looks like importlib.resources is maybe the most popular. I believe this would look like:

# module.py - Approach 3
from pathlib import Path
import importlib.resources

data_path_resource = importlib.resources('mypkg.data', 'data.csv')
with data_path_resources as resource:
    data_path = resource

Why is this final approach better than __file__? It seems like __file__ won't work if the source code is zipped. This is the case I'm not familiar with and which also sounds a bit fringe. I don't think my code will ever be run zipped..

The added overhead from importlib seems a little ridiculous. I need to add an empty __init__.py in the data folder, I need to import importlib, and I need to use a context manager just to access a relative path.

What am I missing about the benefits of the importlib strategy? Why not just use __file__?

edit: One possible justification for the importlib approach is that it has slightly improved semantics. That is data.csv should be thought of as part of the package, so we should access it using something like from mypkg import data.csv but of course this syntax only works for importing .py python modules. But importlib.resources is sort of porting the "import something from some package" semantics to more general file types.

By contrast, the syntax of building a relative path from __file__ is sort of saying: this module is incidentally close to the data file in the file structure so let's take advantage of that to access it. The fact that the data file is part of the package isn't leveraged.

Jagerber48
  • 488
  • 4
  • 13
  • 2
    Did you read [wim's answer](https://stackoverflow.com/a/58941536/8601760)? It's the top answer sorted by "Trending (recent votes count more)". It discusses why not to use either of those you mentioned. It recommends `pkgutil`, and `importlib_resources` for Python 3.9+, instead. – aaron Sep 29 '22 at 15:03
  • @aaron I want to understand the top answers in the linked question better. (1) what are more details about the zip/wheel thing? When might that use case occur and what does it look like in detail? (2) in the approaches in the linked answer I want to know how I can get a path to resources so I can open *whatever* type of binary file I have using whatever helper module (csv, h5 etc.), not just opening as a binary. – Jagerber48 Apr 23 '23 at 13:27

1 Answers1

2

You should be able to use something like this with __file__:

import csv
from io import StringIO
from pathlib import Path
import pkgutil
import sys


def main():
    # Point to appropriate ancestor directory
    p = Path(__file__).parent.parent.parent
    sys.path.insert(0, str(p))
    data = pkgutil.get_data('mypkg.data', 'data.csv')
    reader = csv.reader(StringIO(data.decode()))
    for row in reader:
        print(row)


if __name__ == '__main__':
    main()

If the file data.csv contains

Col 1,Col 2
v1,v2

then the above script will print

['Col 1', 'Col 2']
['v1', 'v2']

You can see the whole thing running here if you select the "Shell" tab and run python mypkg/scripts/module.py.

Vinay Sajip
  • 95,872
  • 14
  • 179
  • 191