How can I resolve Python module import problems stemming from the failed import of NumPy C-extensions for running Spark/Python code on a MacBook Pro?

Question

When I try to run the (simplified/illustrative) Spark/Python script shown below in the Mac Terminal (Bash), errors occur if imports are used for numpy, pandas, or pyspark.ml. The sample Python code shown here runs well when using the 'Section 1' imports listed below (when they include from pyspark.sql import SparkSession), but fails when any of the 'Section 2' imports are used. The full error message is shown below; part of it reads: '..._multiarray_umath.cpython-310-darwin.so' (mach-o file, but is an incompatible architecture (have 'arm64', need 'x86_64'). Apparently, there was a problem importing NumPy 'c-extensions' to some of the computing nodes. Is there a way to resolve the error so a variety of pyspark.ml and other imports will function normally? [Spoiler alert: It turns out there is! See the solution below!]

The problem could stem from one or more potential causes, I believe: (1) improper setting of the environment variables (e.g., PATH), (2) an incorrect SparkSession setting in the code, (3) an omitted but necessary Python module import, (4) improper integration of related downloads (in this case, Spark 3.2.1 (spark-3.2.1-bin-hadoop2.7), Scala (2.12.15), Java (1.8.0_321), sbt (1.6.2), Python 3.10.1, and NumPy 1.22.2) in the local development environment (a 2021 MacBook Pro (Apple M1 Max) running macOS Monterey version 12.2.1), or (5) perhaps a hardware/software incompatibility.

Please note that the existing combination of code (in more complex forms), plus software and hardware runs fine to import and process data and display Spark dataframes, etc., using Terminal--as long as the imports are restricted to basic versions of pyspark.sql. Other imports seem to cause problems, and probably shouldn't.

The sample code (a simple but working program only intended to illustrate the problem):

# Example code to illustrate an issue when using locally-installed versions
# of Spark 3.2.1 (spark-3.2.1-bin-hadoop2.7), Scala (2.12.15),
# Java (1.8.0_321), sbt (1.6.2), Python 3.10.1, and NumPy 1.22.2 on a
# MacBook Pro (Apple M1 Max) running macOS Monterey version 12.2.1

# The Python code is run using 'spark-submit test.py' in Terminal

# Section 1.
# Imports that cause no errors (only the first is required):
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *

# Section 2.
# Example imports that individually cause similar errors when used:
# import numpy as np
# import pandas as pd
# from pyspark.ml.feature import StringIndexer
# from pyspark.ml.feature import VectorAssembler
# from pyspark.ml.classification import RandomForestClassifier
# from pyspark.ml import *

spark = (SparkSession
    .builder
    .appName("test.py")
    .enableHiveSupport()
    .getOrCreate())

# The associated dataset is located here (but is not required to replicate the issue):
# https://github.com/databricks/LearningSparkV2/blob/master/databricks-datasets/learning-spark-v2/flights/departuredelays.csv

# Create database and managed tables
spark.sql("DROP DATABASE IF EXISTS learn_spark_db CASCADE")
spark.sql("CREATE DATABASE learn_spark_db")
spark.sql("USE learn_spark_db")
spark.sql("CREATE TABLE us_delay_flights_tbl(date STRING, delay INT, distance INT, origin STRING, destination STRING)")

# Display (print) the database
print(spark.catalog.listDatabases())

print('Completed with no errors!')

Here is the error-free output that results when only Section 1 imports are used (some details have been replaced by '...'):

MacBook-Pro ~/.../Spark2/spark-3.2.1-bin-hadoop2.7/LearningSparkGitHub/chapter4/py/src$ spark-submit test.py
[Database(name='default', description='Default Hive database', locationUri='file:/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7/LearningSparkGitHub/chapter4/py/src/spark-warehouse'), Database(name='learn_spark_db', description='', locationUri='file:/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7/LearningSparkGitHub/chapter4/py/src/spark-warehouse/learn_spark_db.db')]
Completed with no errors!

Here is the error that typically results when using from pyspark.ml import * or other (Section 2) imports individually:

MacBook-Pro ~/.../Spark2/spark-3.2.1-bin-hadoop2.7/LearningSparkGitHub/chapter4/py/src$ spark-submit test.py
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/__init__.py", line 23, in <module>
    from . import multiarray
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/multiarray.py", line 10, in <module>
    from . import overrides
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/overrides.py", line 6, in <module>
    from numpy.core._multiarray_umath import (
ImportError: dlopen(/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-darwin.so, 0x0002): tried: '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-darwin.so' (mach-o file, but is an incompatible architecture (have 'arm64', need 'x86_64')), '/usr/lib/_multiarray_umath.cpython-310-darwin.so' (no such file)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7/LearningSparkGitHub/chapter4/py/src/test.py", line 28, in <module>
    from pyspark.ml import *
  File "/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/__init__.py", line 22, in <module>
  File "/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/base.py", line 25, in <module>
  File "/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/param/__init__.py", line 21, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/__init__.py", line 144, in <module>
    from . import core
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/__init__.py", line 49, in <module>
    raise ImportError(msg)
ImportError: 

IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!

Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.

We have compiled some common reasons and troubleshooting tips at:

    https://numpy.org/devdocs/user/troubleshooting-importerror.html

Please note and check the following:

  * The Python version is: Python3.10 from "/Library/Frameworks/Python.framework/Versions/3.10/bin/python3"
  * The NumPy version is: "1.22.2"

and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.

Original error was: dlopen(/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-darwin.so, 0x0002): tried: '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-darwin.so' (mach-o file, but is an incompatible architecture (have 'arm64', need 'x86_64')), '/usr/lib/_multiarray_umath.cpython-310-darwin.so' (no such file)

To respond to the comment mentioned in the error message: Yes, the Python and NumPy versions noted above appear to be correct. (But it turns out the reference to Python 3.10 was misleading, as it was probably a reference to Python 3.10.1 rather than Python 3.10.2, as mentioned in Edit 1, below.)

For your reference, here are the settings currently used in the ~/.bash_profile:

export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_321.jdk/Contents/Home/
export SPARK_HOME=/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7
export SBT_HOME=/Users/.../Spark2/sbt
export SCALA_HOME=/Users/.../Spark2/scala-2.12.15
export PATH=$JAVA_HOME/bin:$SBT_HOME/bin:$SBT_HOME/lib:$SCALA_HOME/bin:$SCALA_HOME/lib:$PATH
export PATH=$JAVA_HOME/bin:$SPARK_HOME:$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
export PYSPARK_PYTHON=python3
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
# export PYSPARK_DRIVER_PYTHON="jupyter"
# export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
PATH="/Library/Frameworks/Python.framework/Versions/3.10/bin:${PATH}"
export PATH

# Misc: cursor customization, MySQL
export PS1="\h \w$ "
export PATH=${PATH}:/usr/local/mysql/bin/

# Not used, but available:
# export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk-16.0.1.jdk/Contents/Home
# export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home
# export PATH=$PATH:$SPARK_HOME/bin

# For use of SDKMAN!
export SDKMAN_DIR="$HOME/.sdkman"
[[ -s "$HOME/.sdkman/bin/sdkman-init.sh" ]] && source "$HOME/.sdkman/bin/sdkman-init.sh"

The following website was helpful for loading and integrating Spark, Scala, Java, sbt, and Python (versions noted above): https://kevinvecmanis.io/python/pyspark/install/2019/05/31/Installing-Apache-Spark.html. Please note that the jupyter and notebook driver settings have been commented-out in the Bash profile because they are probably unnecessary (and because at one point, they seemed to interfere with the use of spark-submit commands in Terminal).

A review of the referenced numpy.org website did not help much: https://numpy.org/devdocs/user/troubleshooting-importerror.html

In response to some of the comments on the numpy.org website: a Python3 shell runs fine in the Mac Terminal, and pyspark and other imports (numpy, etc.) work there normally. Here is the output that results when printing the PYTHONPATH and PATH variables from Python interactively (with a few details replaced by '...'):

>>> import os
>>> print("PYTHONPATH:", os.environ.get('PYTHONPATH'))
PYTHONPATH: /Users/.../Spark2/spark-3.2.1-bin-hadoop2.7/python/:
>>> print("PATH:", os.environ.get('PATH'))
PATH: /Users/.../.sdkman/candidates/sbt/current/bin:/Library/Frameworks/Python.framework/Versions/3.10/bin:/Library/Java/JavaVirtualMachines/jdk1.8.0_321.jdk/Contents/Home//bin:/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7:/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7/bin:/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7/sbin:/Library/Java/JavaVirtualMachines/jdk1.8.0_321.jdk/Contents/Home//bin:/Users/.../Spark2/sbt/bin:/Users/.../Spark2/sbt/lib:/Users/.../Spark2/scala-2.12.15/bin:/Users/.../Spark2/scala-2.12.15/lib:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/MacGPG2/bin:/Library/Apple/usr/bin:/usr/local/mysql/bin/

(I am not sure which portion of this output points to a problem.)

The previously attempted remedies included these (all unsuccessful):

The use and testing of a variety of environment variables in the ~/.bash_profile
Uninstallation and reinstallation of Python and NumPy using pip3
Re-installation of Spark, Scala, Java, Python, and sbt in a (new) local dev environment
Many Internet searches on the error message, etc.

To date, no action has resolved the problem.

Edit 1

I am adding recently discovered information.

First, it appears the PATH setting mentioned above (export PYSPARK_PYTHON=python3) was pointing toward Python 3.10.1 located in /Library/Frameworks/Python.framework/Versions/3.10/bin/python3 rather than to Python 3.10.2 in my development environment. I subsequently uninstalled Python 3.10.1 and reinstalled Python 3.10.2 (python-3.10.2-macos11.pkg) on my Mac (macOS Monterey 12.2.1), but have not yet changed the PYSPARK_PYTHON path to point toward the dev environment (suggestions would be welcome on how to do that). The code still throws errors as described previously.

Second, it may help to know a little more about the architecture of the computer, since the error message pointed to a potential hardware-software incompatiblity:

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-darwin.so' (mach-o file, but is an incompatible architecture (have 'arm64', need 'x86_64')

The computer is a "MacBookPro18,2" with an Apple M1 Max chip (10 cores: 8 performance, and 2 efficiency; 32-core GPU). Some websites like these (https://en.wikipedia.org/wiki/Apple_silicon#Apple_M1_Pro_and_M1_Max, https://github.com/conda-forge/miniforge/blob/main/README.md) suggest 'Apple silicon' like the M1 Max needs software designed for the 'arm64' architecture. Using Terminal on the Mac, I checked the compatibility of Python 3.10.2 and the troublesome _multiarray_umath.cpython-310-darwin.so file. Python 3.10.2 is a 'universal binary' with 2 architectures (x86_64 and arm64), and the file is exclusively arm64:

MacBook-Pro ~$ python3 --version
Python 3.10.2
MacBook-Pro ~$ whereis python3
/usr/bin/python3
MacBook-Pro ~$ which python3
/Library/Frameworks/Python.framework/Versions/3.10/bin/python3
MacBook-Pro ~$ file /Library/Frameworks/Python.framework/Versions/3.10/bin/python3
/Library/Frameworks/Python.framework/Versions/3.10/bin/python3: Mach-O universal binary with 2 architectures: [x86_64:Mach-O 64-bit executable x86_64] [arm64:Mach-O 64-bit executable arm64]
/Library/Frameworks/Python.framework/Versions/3.10/bin/python3 (for architecture x86_64):   Mach-O 64-bit executable x86_64
/Library/Frameworks/Python.framework/Versions/3.10/bin/python3 (for architecture arm64):    Mach-O 64-bit executable arm64
MacBook-Pro ~$ file /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-darwin.so
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-darwin.so: Mach-O 64-bit bundle arm64

So I am still puzzled by the error message, which says 'x86_64' is needed for something (hardware or software?) to run this script. Do you need a special environment to run PySpark scripts on an Apple M1 Max chip? As discussed previously, PySpark seems to work fine on the same computer in Python's interactive mode:

MacBook-Pro ~$ python3
Python 3.10.2 (v3.10.2:a58ebcc701, Jan 13 2022, 14:50:16) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyspark
>>> from pyspark.sql import SparkSession
>>> from pyspark.ml import *
>>> import numpy as np
>>>

Is there a way to resolve the error so a variety of pyspark.ml and other imports will function normally in a Python script? Perhaps the settings in the ~/.bash_profile need to be changed? Would a different version of the _multiarray_umath.cpython-310-darwin.so file solve the problem, and if so, how would I obtain it? (Use a different version of Python?) I am seeking suggestions for code, settings, and/or actions. Perhaps there is an easy fix I have overlooked.

score 0 · Accepted Answer · edited Mar 12 '22 at 22:10

Solved it. The errors experienced while trying to import numpy c-extensions involved the challenge of ensuring each computing node had the environment it needed to execute the target script (test.py). It turns out this can be accomplished by zipping the necessary modules (in this case, only numpy) into a tarball (.tar.gz) for use in a 'spark-submit' command to execute the Python script. The approach I used involved leveraging conda-forge/miniforge to 'pack' the required dependencies into a file. (It felt like a hack, but it worked.)

The following websites were helpful for developing a solution:

Hyukjin Kwon's blog, "How to Manage Python Dependencies in PySpark" https://databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html
"Python Package Management: Using Conda": https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html
Alex Ziskind's video "python environment setup on Apple Silicon | M1, M1 Pro/Max with Conda-forge": https://www.youtube.com/watch?v=2Acht_5_HTo
conda-forge/miniforge on GitHub: https://github.com/conda-forge/miniforge (for Apple chips, use the Miniforge3-MacOSX-arm64 download for OS X (arm64, Apple Silicon).

Steps for implementing a solution:

Install conda-forge/miniforge on your computer (in my case, a MacBook Pro with Apple silicon), following Alex's recommendations. You do not yet need to activate any conda environment on your computer. During installation, I recommend these settings:

Do you wish the installer to initialize Miniforge3
by running conda init? [yes|no] >>> choose 'yes'

If you'd prefer that conda's base environment not be activated on startup, 
set the auto_activate_base parameter to false: 
conda config --set auto_activate_base false  # Set to 'false' for now

After you have conda installed, cd into the directory that contains your Python (PySpark) script (i.e., the file you want to run--in the case discussed here, 'test.py').
Enter the commands recommended in the Spark documentation (see URL above) for "Using Conda." Include in the first line (shown below) a space-separated sequence of the modules you need (in this case, only numpy, since the problem involved the failure to import numpy c-extensions). This will create the tarball you need, pyspark_conda_env.tar.gz (with all of the required modules and dependencies for each computing node) in the directory where you are (the one that contains your Python script):

conda create -y -n pyspark_conda_env -c conda-forge numpy conda-pack
conda activate pyspark_conda_env
conda pack -f -o pyspark_conda_env.tar.gz

(If needed, you could insert 'pyarrow pandas numpy', etc., instead of just 'numpy' in the first line of three shown above, if you require multiple modules such as these three. Pandas appears to have a dependency on pyarrow.)

The command conda activate pyspark_conda_env (above) will activate your new environment, so now is a good time to investigate which version of Python your conda environment has, and where it exists (you only need to do this once). You will need this information to set your PYSPARK_PYTHON environment variable in your ~/.bash_profile:

(pyspark_conda_env) MacBook-Pro ~$ python --version
Python 3.10.2
(pyspark_conda_env) MacBook-Pro ~$ which python
/Users/.../miniforge3/envs/pyspark_conda_env/bin/python

If you need a different version of Python, you can instruct conda to install it (see Alex's video).

Ensure your ~/.bash_profile (or similar profile) includes the following setting (filling-in the exact path you just discovered):

export PYSPARK_PYTHON=/Users/.../miniforge3/envs/pyspark_conda_env/bin/python

Remember to 'source' any changes to your profile (e.g., source ~/.bash_profile) or simply restart your Terminal so the changes take effect.

Use a command similar to this to run your target script (assuming you are in the same directory discussed above). The Python script should now execute successfully, with no errors:

spark-submit --archives pyspark_conda_env.tar.gz test.py

There are several other ways to use the tarball to ensure it is automatically unpacked on the Spark executors (nodes) to run your script. See the Spark documentation discussed above, if needed, to learn about them.

For clarity, here are the final ~/.bash_profile settings that worked for this installation, which included the ability to run Scala scripts in Spark. If you are not using Scala, the SBT_HOME and SCALA_HOME settings may not apply to you. Also, you may or may not need the PYTHONPATH setting. How to tailor it to your specific version of py4j is discussed in 'How to install PySpark locally' (https://sigdelta.com/blog/how-to-install-pyspark-locally/).

export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_321.jdk/Contents/Home
export SPARK_HOME=/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7
export SBT_HOME=/Users/.../Spark2/sbt
export SCALA_HOME=/Users/.../Spark2/scala-2.12.15
export PATH=$JAVA_HOME/bin:$SBT_HOME/bin:$SBT_HOME/lib:$SCALA_HOME/bin:$SCALA_HOME/lib:$PATH
export PATH=$JAVA_HOME/bin:$SPARK_HOME:$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
export PYSPARK_PYTHON=/Users/.../miniforge3/envs/pyspark_conda_env/bin/python
PYTHONPATH=$SPARK_HOME$\python:$SPARK_HOME$\python\lib\py4j-0.10.9.3-src.zip:$PYTHONPATH
PATH="/Library/Frameworks/Python.framework/Versions/3.10/bin:${PATH}"
export PATH

# export PYSPARK_DRIVER_PYTHON="jupyter"        # Not required
# export PYSPARK_DRIVER_PYTHON_OPTS="notebook"  # Not required

If you have suggestions on how to improve these settings, please comment below.

Other notes:

Your Python script should still include the other imports you need (in my case, there was no need to include a numpy import in the script itself--only numpy in the tarball). So your script might include these, for example:

from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
(etc.)

My script did not require this code snippet, which was shown in an example in the Spark documentation:

if __name__ == "__main__":
    main(SparkSession.builder.getOrCreate())

The approach of simply creating a 'requirements.txt' file containing a list of modules (to zip and use in a spark-submit command without using conda) as discussed in this thread I can't seem to get --py-files on Spark to work did not work in my case:

pip3 install -t dependencies -r requirements.txt
zip -r dep.zip dependencies # Possibly incorrect...
zip -r dep.zip .            # Correct if run from within folder containing requirements.txt 
spark-submit --py-files dep.zip test.py

See 'PySpark dependencies' by Daniel Corin for more details on this approach, which clearly works in certain cases: https://blog.danielcorin.com/posts/2015-11-09-pyspark/

I'm speculating a bit, but I think this approach may not allow packages built as 'wheels', so not all the dependencies you need will be built. The Spark documentation discusses this concept under "Using PySpark Native Features." (Feel free to test it out...you will not need conda-forge/miniforge to do so.)

How can I resolve Python module import problems stemming from the failed import of NumPy C-extensions for running Spark/Python code on a MacBook Pro?

Edit 1

1 Answers1