I am attempting to include the dateparser package for a PySpark (v2.4.3) shell session by a short little zip build process pip install -r requirements.txt -t some_target && cd some_target && zip -r ../deps.zip . && cd ..
, after which I would, for example, pyspark --py-files deps.zip
. When import
ing dateparser
, though, I get an indirect ModuleNotFoundError from the regex library, whining that "No module named 'regex._regex'" (stack trace says this is referenced in /mnt/tmp/spark-some/long/path/deps.zip/regex/_regex_core.py line 21, which is of course referenced much farther up the stack by dateparser).
I attempted adding a flag to the dateparser line in requirements.txt like dateparser --no-binary=regex
, but the error persisted. A normal python shell is able to import without issue, and other packages in this zip seem to be importable in PySpark shell without issue. This has led me down a number of rabbit holes, but I think/hope I have finally found the culprit: namely, that regex._regex is not a normal .py
file, but rather a .so
. My knowledge of python build process is limited, but it seems that regex library's setup.py uses the setuptools.Extension
class to compile some C files into this shared object. I have seen suggestions to modify LD_LIBRARY_PATH environment variable in order to make those shared objects discoverable to python, but a number of comments also suggested this was dangerous and not a viable long-term solution. The fact that a normal python interactive session has no issue with the import also has me skeptical, since the LD_LIBRARY_PATH variable doesn't even exist in os.environ
within that interactive shell. I'm thence left wondering if --py-files
is insufficient for including packages that compile these Extension objects (seems unlikely, since there are a lot of people doing crazier things than my simple use case), or if this actually stems from neglecting some other setting.
Merci mille fois for any and all help :)