0

I am attempting to include the dateparser package for a PySpark (v2.4.3) shell session by a short little zip build process pip install -r requirements.txt -t some_target && cd some_target && zip -r ../deps.zip . && cd .., after which I would, for example, pyspark --py-files deps.zip. When importing dateparser, though, I get an indirect ModuleNotFoundError from the regex library, whining that "No module named 'regex._regex'" (stack trace says this is referenced in /mnt/tmp/spark-some/long/path/deps.zip/regex/_regex_core.py line 21, which is of course referenced much farther up the stack by dateparser).

I attempted adding a flag to the dateparser line in requirements.txt like dateparser --no-binary=regex, but the error persisted. A normal python shell is able to import without issue, and other packages in this zip seem to be importable in PySpark shell without issue. This has led me down a number of rabbit holes, but I think/hope I have finally found the culprit: namely, that regex._regex is not a normal .py file, but rather a .so. My knowledge of python build process is limited, but it seems that regex library's setup.py uses the setuptools.Extension class to compile some C files into this shared object. I have seen suggestions to modify LD_LIBRARY_PATH environment variable in order to make those shared objects discoverable to python, but a number of comments also suggested this was dangerous and not a viable long-term solution. The fact that a normal python interactive session has no issue with the import also has me skeptical, since the LD_LIBRARY_PATH variable doesn't even exist in os.environ within that interactive shell. I'm thence left wondering if --py-files is insufficient for including packages that compile these Extension objects (seems unlikely, since there are a lot of people doing crazier things than my simple use case), or if this actually stems from neglecting some other setting.

Merci mille fois for any and all help :)

1 Answers1

0

The error appears to stem from the import statements not being able to recognize binary (.so) files within a zip archive, i.e., the dependencies.zip that I pass with the --py-files parameter. I first tried pulling out regex dependency and building a .whl to include in --py-files, to discover that my version of PySpark (v2.4.3) predates wheel support. I was, however, able to build an .egg based on the source code, then set PYTHON_EGG_CACHE and PYTHON_EGG_DIR env variables for spark.executorEnv and spark.driverEnv... Not sure if the last step would be necessary for others; it seems to have stemmed from weird permissions issues that may just apply to my user/group/use case.