Access a file from a python egg

Question

Hi I am working with python packaging. I have 3 non-code files namely ['synonyms.csv', 'acronyms.csv', 'words.txt'].

These files exist in a folder structure Wordproject/WordProject/Repository/DataBank/
I have a RepositoryReader class at the path Wordproject/WordProject/Repository/
I've written a code that pulls the current location of the RepositoryReader and then looks for a subdirectory called DataBank and looks for the 3 files there.

The problem is when I create an egg out of the code, and then run it,

My code gives me the error:

Could not find the file at X:\1. Projects\Python\Wordproject\venv\lib\site-packages\Wordproject-1.0-py3.6.egg\Wordproject\Repository\DataBank\synonyms.csv

It's not able to fetch the file or read it from the path if the path is of an egg. Is there any way around it? These files have to be in an egg.

Is your goal to have these files installed somewhere accessible at `pip install` time, or to have them embedded in the package directory and access them the same way you can access submodules? — abarnert, Apr 11 '18 at 18:55
@abarnert actually I can't push this code to `PyPI` since it's an organizational thing. But I am more interested in `obfuscation` of the code such that even with the access no one can actually decompile the package. And hence I want the files to be embedded inside the package itself. — iam.Carrot, Apr 11 '18 at 19:04
Well, you're not going to get much obfuscation out of an egg file. It's basically just a zipfile plus a manifest telling you where all the interesting files are, which might slow down a novice hacker for about 60 seconds… — abarnert, Apr 11 '18 at 19:10
@abarnert anything that you would recommend for this kinda requirement? — iam.Carrot, Apr 11 '18 at 19:11
Depends on why you're trying to obfuscate things, but the usual best answer is: don't try; almost anything you come up with will cost more than it's worth and will only give you a false sense of security that prevents you from a better solution. There are rare cases where obfuscating Python code (and engaging in a potentially unending arms race with some opponent) is worth doing, but 99% of the time when people ask for this, they don't even have any idea who the attacker might be, and there probably won't be one, and the way they're trying to protect things wouldn't help anyway. — abarnert, Apr 11 '18 at 19:13

score 3 · Answer 1 · edited Sep 13 '18 at 00:18

3

egg files are just renamed .zip files.

You can use the zipfile library to open the egg and extract or read the file you need.

import zipfile

zip = zipfile.ZipFile('/path/to/file.egg', 'r')

# open file from within the egg
f = zip.open('synonyms.csv', 'r')
txt = f.read()

edited Sep 13 '18 at 00:18

ViFI

971
1
11
27

answered Apr 11 '18 at 18:29

Brendan Abel

35,343
14
88
118

so you mean I unzip the file egg at runtime and then read through it? Where do I unzip it? – iam.Carrot Apr 11 '18 at 18:30
@iam.Carrot Updated my answer to show how to read files directly out of the zip archive, no need to extract the data to disk. – Brendan Abel Apr 11 '18 at 18:32
I am using `pandas` to read the `csv`, is there a way in pandas through which I can read the file? – iam.Carrot Apr 11 '18 at 18:34
@iam.Carrot The object returned from `zip.open` is a file-like object that you should be able to feed directly to `pandas.read_csv` – Brendan Abel Apr 11 '18 at 18:36
I would also suggest using a context manager! `with zipfile.ZipFile('/path/to/file.egg/', 'r') as zip:` – Josie Thompson Sep 13 '18 at 00:18

abarnert · Accepted Answer · 2018-04-11T19:09:49.727

1

There are two different things you could be trying to do here:

Treat the data files as part of your package, like the Python modules, and access them at runtime as if your package were a normal directory tree even if it isn't.
Get the data files installed somewhere else at pip install time, to a location you can access normally.

Both are explained in the section on data files in the PyPA/setuptools docs. I think you want the first one here, which is covered in the subsection on Accessing Data Files at Runtime:

Typically, existing programs manipulate a package’s __file__ attribute in order to find the location of data files. However, this manipulation isn’t compatible with PEP 302-based import hooks, including importing from zip files and Python Eggs. It is strongly recommended that, if you are using data files, you should use the ResourceManager API of pkg_resources to access them. The pkg_resources module is distributed as part of setuptools, so if you’re using setuptools to distribute your package, there is no reason not to use its resource management API. See also Accessing Package Resources for a quick example of converting code that uses __file__ to use pkg_resources instead.

Follow that link, and you find what look like some crufty old PEAK docs, but that's only because they really are crufty old PEAK docs. There is a version buried inside the setuptools docs that you may find easier to read and navigate once you manage to find it.

As it says, you could try using get_data (which will work inside an egg/zip) and then fall back to accessing a file (which will work when running from source), but you're better off using the wrappers in pkg_resources. Basically, if your code was doing this:

path = os.path.join(__file__, 'Wordproject/WordProject/Repository/DataBank/', datathingy)
with open(path) as f:
    for line in f:
        do_stuff(line)

… you'll change it to this:

path = 'Wordproject/WordProject/Repository/DataBank/' + datathingy
f = pkg_resources.resource_stream(__name__, path)
for line in f:
    do_stuff(line.decode())

Notice that resource_stream files are always opened in binary mode. So if you want to read them as text, you need to wrap a TextIOWrapper around them, or decode each line.

edited Apr 11 '18 at 19:09

answered Apr 11 '18 at 19:03

abarnert

354,177
51
601
671

TBH My first intuition was using the resource manager API itself. But I couldn't get it to work. When I passed in the file name with a folder structure, it threw me an error and hence I opted for the question here. It would be of great help if you could showcase a sample code for this where the file egg is `WordProject` while it has a subdirectory `Repository` and inside that directory I have another directory `DataBank` and I am reading files from there. – iam.Carrot Apr 11 '18 at 19:10
@iam.Carrot I can't build a sample that matches your layout, because I don't know your layout. But you can give us an [mcve] that shows exactly what you tried, and exactly what error you got, and then we can help debug that. – abarnert Apr 11 '18 at 19:11
yeah I kinda saw that one coming. I'll see if I can work on something for it – iam.Carrot Apr 11 '18 at 19:12
I have one last question, I was using `pandas` to read the `csv` file as a Dataframe, is there a way I can achieve this using the resource manager API? – iam.Carrot Apr 11 '18 at 19:26
@iam.Carrot I haven’t tried it, but I think Pandas can use a resource stream the same way it uses an actual open file. If not… I assume the file is too big to read the whole thing into memory and then tell Pandas to parse it as a string or you would have just done that, so you might have to create a tempfile, copy the stream to the tempfile, then have Pandas open that, but that’s a worst-comes-to-worst fallback. – abarnert Apr 11 '18 at 20:08
@iam.Carrot - By any chance did you get a proper solution? – Manish Mar 03 '21 at 07:22
@Manish I actually used the above solution where I used the above solution using `pkg_resources` to fetch the base path of the file and then defined a `relative path` to the file I was trying to load and it worked. If you'd like I can share a sample. – iam.Carrot Mar 05 '21 at 17:29
Python 3.7 added importlib_resources, and pkg_resources documentation seems to suggest people should use that instead. – chrisinmtown Jan 03 '22 at 18:50

score 0 · Answer 3 · answered Sep 14 '18 at 15:08

Based on the documentation, We can read the contents of file in multiple ways.

Solution 1: Read the contents of file directly into the memory.

Without extracting the file locally.

import zipfile, tempfile
tfile = tempfile.NamedTemporaryFile()
with zipfile.ZipFile('/path/to/egg.egg') as myzip:
    with myzip.open('relative/path/to/file.txt') as myfile:
        tfile.write(myfile.read())

# .. do something with temporary file

tfile.close()

Now tfile is your local temporary file handle. It's name is tfile.name and all file operations such as open(tfile) etc. work as usual on this. tfile.close() must be called at the end to close the handle.

Contents of file can be read by myfile.read() itself but we lose myfile handle as soon as we exit the context. So contents of file are copied into a temporary file if it needs to be passed around for other operations.

Solution 2 : Extract the member of egg locally

zipfile provides an API for extracting the specific member

import zipfile
x = zipfile.ZipFile('/path/to/egg.egg')
x.extractall(path='temp/dest/folder', members=['path/to/file.txt'])

Solution 3 : Extract the whole egg

Another solution is to extract the egg in temporary folder and then read the file. Egg can be extracted on command line as following

python -m zipfile -e path/to/my.egg ./temp_destination

score 0 · Answer 4 · answered Jan 03 '22 at 18:53

If you're using Python 3.7 or later, I suggest using importlib_resources. From their doc https://importlib-resources.readthedocs.io/en/latest/using.html here's an example of getting a YAML file tucked into a module:

from importlib_resources import files, as_file

yaml_path = files('my-module').joinpath('openapi.yml')
with as_file(yaml_path) as yaml:
    conn_app.add_api(yaml)

This works if the module is installed in a directory via pip3 install . and also if installed as an egg (zip) file via python3 setup.py install

Access a file from a python egg

4 Answers4

Linked