109

I use setuptools to distribute my python package. Now I need to distribute additional datafiles.

From what I've gathered fromt the setuptools documentation, I need to have my data files inside the package directory. However, I would rather have my datafiles inside a subdirectory in the root directory.

What I would like to avoid:

/ #root
|- src/
|  |- mypackage/
|  |  |- data/
|  |  |  |- resource1
|  |  |  |- [...]
|  |  |- __init__.py
|  |  |- [...]
|- setup.py

What I would like to have instead:

/ #root
|- data/
|  |- resource1
|  |- [...]
|- src/
|  |- mypackage/
|  |  |- __init__.py
|  |  |- [...]
|- setup.py

I just don't feel comfortable with having so many subdirectories, if it's not essential. I fail to find a reason, why I /have/ to put the files inside the package directory. It is also cumbersome to work with so many nested subdirectories IMHO. Or is there any good reason that would justify this restriction?

Lolindrath
  • 2,101
  • 14
  • 20
phant0m
  • 16,595
  • 5
  • 50
  • 82
  • 11
    I asked a similar question about using 'data_files' to distribute resources (docs, images, etc): http://stackoverflow.com/questions/5192386/installing-my-sdist-from-pypi-puts-the-files-in-unexpected-places ...and the (two) responses both said to use 'package_data' instead. Now I'm using package data, but that implies I have to put my data and docs inside my package, i.e. mixed in amongst my source code. I dislike this. When grepping my source, I find not just the class definition that I am searching for, but also the dozens of mentions they get within my RST, HTML and intermediate files. :-( – Jonathan Hartley Mar 25 '11 at 09:33
  • 2
    I know this response is very late, @JonathanHartley , but you can make any directory a "package" by adding an `__init__.py` file, even if that file is blank. So you could keep a data directory separate with an empty `__init__.py` file to make it look like a package. That should keep grep from within your source tree from picking them up but it will still be recognized as a package by python and its build tools. – dhj Sep 04 '14 at 03:32
  • 5
    @dhj the only problem with that approach is python thinks you've installed a package called 'data'. If another package you installed tried to package data in the same way, you would have two conflicting 'data' packages installed. – toes Nov 23 '16 at 15:47

4 Answers4

124

Option 1: Install as package data

The main advantage of placing data files inside the root of your Python package is that it lets you avoid worrying about where the files will live on a user's system, which may be Windows, Mac, Linux, some mobile platform, or inside an Egg. You can always find the directory data relative to your Python package root, no matter where or how it is installed.

For example, if I have a project layout like so:

project/
    foo/
        __init__.py
        data/
            resource1/
                foo.txt

You can add a function to __init__.py to locate an absolute path to a data file:

import os

_ROOT = os.path.abspath(os.path.dirname(__file__))
def get_data(path):
    return os.path.join(_ROOT, 'data', path)

print get_data('resource1/foo.txt')

Outputs:

/Users/pat/project/foo/data/resource1/foo.txt

After the project is installed as an Egg the path to data will change, but the code doesn't need to change:

/Users/pat/virtenv/foo/lib/python2.6/site-packages/foo-0.0.0-py2.6.egg/foo/data/resource1/foo.txt

Option 2: Install to fixed location

The alternative would be to place your data outside the Python package and then either:

  1. Have the location of data passed in via a configuration file, command line arguments or
  2. Embed the location into your Python code.

This is far less desirable if you plan to distribute your project. If you really want to do this, you can install your data wherever you like on the target system by specifying the destination for each group of files by passing in a list of tuples:

from setuptools import setup
setup(
    ...
    data_files=[
        ('/var/data1', ['data/foo.txt']),
        ('/var/data2', ['data/bar.txt'])
        ]
    )

Updated: Example of a shell function to recursively grep Python files:

atlas% function grep_py { find . -name '*.py' -exec grep -Hn $* {} \; }
atlas% grep_py ": \["
./setup.py:9:    package_data={'foo': ['data/resource1/foo.txt']}
samplebias
  • 37,113
  • 6
  • 107
  • 103
  • 10
    Thanks very much for helping me come to terms with the situation. So I'm happy to run with using package_data as you (and everyone else) suggests. However: Is it only me who finds putting their data & docs inside their package source directory to be inconveniently messy? (e.g. grepping my source returns dozens of unwanted hits from my documentation. I could add '--exclude-dir' params to grep every time I ever use it, which would differ from one project to the next, but that seems icky) Is it possible to something like include a 'src' subdir inside my package dir without breaking imports, etc – Jonathan Hartley Mar 25 '11 at 09:43
  • I usually only put data files that the package requires under the package dir. I would install the docs as `data_files`. Also, you could come up with a shell alias for grep to ignore non-Python files, something like `grep_py`. – samplebias Mar 25 '11 at 16:04
  • Hey samplebias. Thanks for the updates. It's not just grep though, it's *everything*, from text editor search-in-files to ctags to awk. I'm going to try reorging my project to put docs in data_files as you suggest, see how that works out. Back soon... :-) – Jonathan Hartley Mar 29 '11 at 22:19
  • ...that seems to work out OK. Thanks for setting me on the right track. Are the +50 reputation points tasty? – Jonathan Hartley Mar 30 '11 at 13:07
  • Thanks! Great to hear, glad it worked out and you're making progress! – samplebias Mar 30 '11 at 17:06
  • I currently have the same problem. But I realised, that using `data_files` makes `easy_install` unhappy. It raises a `SandboxException` claiming the package cannot be safely installed if I do that. I was planning to install docs to `/usr/doc` and some sample scripts to `/usr/share`, which both are outside the sandbox :( – exhuma Dec 12 '11 at 08:10
  • what would happen if you used `os.path.join(_ROOT, 'data', path)` in `data_files`? – Jonathan Jan 21 '15 at 16:00
  • ``package_data`` is not only ugly. It also violates the FHS standard: http://www.pathname.com/fhs/pub/fhs-2.3.html#USRLIBLIBRARIESFORPROGRAMMINGANDPA – Toon Verstraelen May 18 '15 at 23:35
  • `data_files` destroys directory structure--every leaf file gets dropped into the .egg directory on install. `package_data` apparently only works for binary distributions (not source). How hard is it to copy a directory? – weberc2 Jun 13 '15 at 21:26
  • Since some time has passed: Is this still the preffered/best way to handle data files? – Daniel Mar 15 '18 at 10:38
  • Option 2 was what I was looking for. Thanks a lot. – Umair Aslam May 15 '18 at 16:26
  • Can someone explain how you use Option 1's `get_data` function? how should I import `__init__.py` ? – Tomas G. Mar 12 '20 at 08:28
21

I Think I found a good compromise which will allow you to mantain the following structure:

/ #root
|- data/
|  |- resource1
|  |- [...]
|- src/
|  |- mypackage/
|  |  |- __init__.py
|  |  |- [...]
|- setup.py

You should install data as package_data, to avoid the problems described in samplebias answer, but in order to mantain the file structure you should add to your setup.py:

try:
    os.symlink('../../data', 'src/mypackage/data')
    setup(
        ...
        package_data = {'mypackage': ['data/*']}
        ...
    )
finally:
    os.unlink('src/mypackage/data')

This way we create the appropriate structure "just in time", and mantain our source tree organized.

To access such data files within your code, you 'simply' use:

data = resource_filename(Requirement.parse("main_package"), 'mypackage/data')

I still don't like having to specify 'mypackage' in the code, as the data could have nothing to do necessarally with this module, but i guess its a good compromise.

alexsmail
  • 5,661
  • 7
  • 37
  • 57
polvoazul
  • 2,208
  • 2
  • 19
  • 25
  • This doesn't work with source distribution (e.g., `python setup.py sdist`) because in the source distribution tarball, `data/` is already in `mypackge/data` and would error during `pip install mypackage.tar.gz`. – Keto Jan 24 '22 at 20:59
0

I could use importlib_resources or importlib.resources (depending on python version).

https://importlib-resources.readthedocs.io/en/latest/using.html

AbbasTari
  • 1
  • 3
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Oct 22 '22 at 12:34
-4

I think that you can basically give anything as an argument *data_files* to setup().

lgautier
  • 11,363
  • 29
  • 42
  • Hmm... I can see that it is in the distutils documentation, can't see it in the setuptools documentation though. Anyway, how would I be able to access it eventually? – phant0m Dec 23 '10 at 17:05
  • I think data_files should only be used for data which is shared between several packages. for example, if you pip install from PyPI, then files listed in data_files are installed to directories directly under your main Python install dir. (ie. not in Python27/Lib/site-packages/mypackage, but in parallel with 'Python27/Lib') – Jonathan Hartley Mar 24 '11 at 00:42