When I try to run the (simplified/illustrative) Spark/Python script shown below in the Mac Terminal (Bash), errors occur if imports are used for numpy
, pandas
, or pyspark.ml
. The sample Python code shown here runs well when using the 'Section 1' imports listed below (when they include from pyspark.sql import SparkSession
), but fails when any of the 'Section 2' imports are used. The full error message is shown below; part of it reads: '..._multiarray_umath.cpython-310-darwin.so' (mach-o file, but is an incompatible architecture (have 'arm64', need 'x86_64')
. Apparently, there was a problem importing NumPy 'c-extensions' to some of the computing nodes. Is there a way to resolve the error so a variety of pyspark.ml
and other imports will function normally? [Spoiler alert: It turns out there is! See the solution below!]
The problem could stem from one or more potential causes, I believe: (1) improper setting of the environment variables (e.g., PATH
), (2) an incorrect SparkSession
setting in the code, (3) an omitted but necessary Python module import, (4) improper integration of related downloads (in this case, Spark 3.2.1 (spark-3.2.1-bin-hadoop2.7), Scala (2.12.15), Java (1.8.0_321), sbt (1.6.2), Python 3.10.1, and NumPy 1.22.2) in the local development environment (a 2021 MacBook Pro (Apple M1 Max) running macOS Monterey version 12.2.1), or (5) perhaps a hardware/software incompatibility.
Please note that the existing combination of code (in more complex forms), plus software and hardware runs fine to import and process data and display Spark dataframes, etc., using Terminal--as long as the imports are restricted to basic versions of pyspark.sql
. Other imports seem to cause problems, and probably shouldn't.
The sample code (a simple but working program only intended to illustrate the problem):
# Example code to illustrate an issue when using locally-installed versions
# of Spark 3.2.1 (spark-3.2.1-bin-hadoop2.7), Scala (2.12.15),
# Java (1.8.0_321), sbt (1.6.2), Python 3.10.1, and NumPy 1.22.2 on a
# MacBook Pro (Apple M1 Max) running macOS Monterey version 12.2.1
# The Python code is run using 'spark-submit test.py' in Terminal
# Section 1.
# Imports that cause no errors (only the first is required):
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
# Section 2.
# Example imports that individually cause similar errors when used:
# import numpy as np
# import pandas as pd
# from pyspark.ml.feature import StringIndexer
# from pyspark.ml.feature import VectorAssembler
# from pyspark.ml.classification import RandomForestClassifier
# from pyspark.ml import *
spark = (SparkSession
.builder
.appName("test.py")
.enableHiveSupport()
.getOrCreate())
# The associated dataset is located here (but is not required to replicate the issue):
# https://github.com/databricks/LearningSparkV2/blob/master/databricks-datasets/learning-spark-v2/flights/departuredelays.csv
# Create database and managed tables
spark.sql("DROP DATABASE IF EXISTS learn_spark_db CASCADE")
spark.sql("CREATE DATABASE learn_spark_db")
spark.sql("USE learn_spark_db")
spark.sql("CREATE TABLE us_delay_flights_tbl(date STRING, delay INT, distance INT, origin STRING, destination STRING)")
# Display (print) the database
print(spark.catalog.listDatabases())
print('Completed with no errors!')
Here is the error-free output that results when only Section 1 imports are used (some details have been replaced by '...'):
MacBook-Pro ~/.../Spark2/spark-3.2.1-bin-hadoop2.7/LearningSparkGitHub/chapter4/py/src$ spark-submit test.py
[Database(name='default', description='Default Hive database', locationUri='file:/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7/LearningSparkGitHub/chapter4/py/src/spark-warehouse'), Database(name='learn_spark_db', description='', locationUri='file:/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7/LearningSparkGitHub/chapter4/py/src/spark-warehouse/learn_spark_db.db')]
Completed with no errors!
Here is the error that typically results when using from pyspark.ml import *
or other (Section 2) imports individually:
MacBook-Pro ~/.../Spark2/spark-3.2.1-bin-hadoop2.7/LearningSparkGitHub/chapter4/py/src$ spark-submit test.py
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/__init__.py", line 23, in <module>
from . import multiarray
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/multiarray.py", line 10, in <module>
from . import overrides
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/overrides.py", line 6, in <module>
from numpy.core._multiarray_umath import (
ImportError: dlopen(/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-darwin.so, 0x0002): tried: '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-darwin.so' (mach-o file, but is an incompatible architecture (have 'arm64', need 'x86_64')), '/usr/lib/_multiarray_umath.cpython-310-darwin.so' (no such file)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7/LearningSparkGitHub/chapter4/py/src/test.py", line 28, in <module>
from pyspark.ml import *
File "/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/__init__.py", line 22, in <module>
File "/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/base.py", line 25, in <module>
File "/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/param/__init__.py", line 21, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/__init__.py", line 144, in <module>
from . import core
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/__init__.py", line 49, in <module>
raise ImportError(msg)
ImportError:
IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!
Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.
We have compiled some common reasons and troubleshooting tips at:
https://numpy.org/devdocs/user/troubleshooting-importerror.html
Please note and check the following:
* The Python version is: Python3.10 from "/Library/Frameworks/Python.framework/Versions/3.10/bin/python3"
* The NumPy version is: "1.22.2"
and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.
Original error was: dlopen(/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-darwin.so, 0x0002): tried: '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-darwin.so' (mach-o file, but is an incompatible architecture (have 'arm64', need 'x86_64')), '/usr/lib/_multiarray_umath.cpython-310-darwin.so' (no such file)
To respond to the comment mentioned in the error message: Yes, the Python and NumPy versions noted above appear to be correct. (But it turns out the reference to Python 3.10 was misleading, as it was probably a reference to Python 3.10.1 rather than Python 3.10.2, as mentioned in Edit 1, below.)
For your reference, here are the settings currently used in the ~/.bash_profile
:
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_321.jdk/Contents/Home/
export SPARK_HOME=/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7
export SBT_HOME=/Users/.../Spark2/sbt
export SCALA_HOME=/Users/.../Spark2/scala-2.12.15
export PATH=$JAVA_HOME/bin:$SBT_HOME/bin:$SBT_HOME/lib:$SCALA_HOME/bin:$SCALA_HOME/lib:$PATH
export PATH=$JAVA_HOME/bin:$SPARK_HOME:$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
export PYSPARK_PYTHON=python3
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
# export PYSPARK_DRIVER_PYTHON="jupyter"
# export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
PATH="/Library/Frameworks/Python.framework/Versions/3.10/bin:${PATH}"
export PATH
# Misc: cursor customization, MySQL
export PS1="\h \w$ "
export PATH=${PATH}:/usr/local/mysql/bin/
# Not used, but available:
# export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk-16.0.1.jdk/Contents/Home
# export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home
# export PATH=$PATH:$SPARK_HOME/bin
# For use of SDKMAN!
export SDKMAN_DIR="$HOME/.sdkman"
[[ -s "$HOME/.sdkman/bin/sdkman-init.sh" ]] && source "$HOME/.sdkman/bin/sdkman-init.sh"
The following website was helpful for loading and integrating Spark, Scala, Java, sbt, and Python (versions noted above): https://kevinvecmanis.io/python/pyspark/install/2019/05/31/Installing-Apache-Spark.html. Please note that the jupyter
and notebook
driver settings have been commented-out in the Bash profile because they are probably unnecessary (and because at one point, they seemed to interfere with the use of spark-submit
commands in Terminal).
A review of the referenced numpy.org website did not help much: https://numpy.org/devdocs/user/troubleshooting-importerror.html
In response to some of the comments on the numpy.org website: a Python3 shell runs fine in the Mac Terminal, and pyspark
and other imports (numpy
, etc.) work there normally. Here is the output that results when printing the PYTHONPATH
and PATH
variables from Python interactively (with a few details replaced by '...'):
>>> import os
>>> print("PYTHONPATH:", os.environ.get('PYTHONPATH'))
PYTHONPATH: /Users/.../Spark2/spark-3.2.1-bin-hadoop2.7/python/:
>>> print("PATH:", os.environ.get('PATH'))
PATH: /Users/.../.sdkman/candidates/sbt/current/bin:/Library/Frameworks/Python.framework/Versions/3.10/bin:/Library/Java/JavaVirtualMachines/jdk1.8.0_321.jdk/Contents/Home//bin:/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7:/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7/bin:/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7/sbin:/Library/Java/JavaVirtualMachines/jdk1.8.0_321.jdk/Contents/Home//bin:/Users/.../Spark2/sbt/bin:/Users/.../Spark2/sbt/lib:/Users/.../Spark2/scala-2.12.15/bin:/Users/.../Spark2/scala-2.12.15/lib:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/MacGPG2/bin:/Library/Apple/usr/bin:/usr/local/mysql/bin/
(I am not sure which portion of this output points to a problem.)
The previously attempted remedies included these (all unsuccessful):
- The use and testing of a variety of environment variables in the
~/.bash_profile
- Uninstallation and reinstallation of Python and NumPy using
pip3
- Re-installation of Spark, Scala, Java, Python, and sbt in a (new) local dev environment
- Many Internet searches on the error message, etc.
To date, no action has resolved the problem.
Edit 1
I am adding recently discovered information.
First, it appears the PATH setting mentioned above (export PYSPARK_PYTHON=python3
) was pointing toward Python 3.10.1 located in /Library/Frameworks/Python.framework/Versions/3.10/bin/python3
rather than to Python 3.10.2 in my development environment. I subsequently uninstalled Python 3.10.1 and reinstalled Python 3.10.2 (python-3.10.2-macos11.pkg) on my Mac (macOS Monterey 12.2.1), but have not yet changed the PYSPARK_PYTHON
path to point toward the dev environment (suggestions would be welcome on how to do that). The code still throws errors as described previously.
Second, it may help to know a little more about the architecture of the computer, since the error message pointed to a potential hardware-software incompatiblity:
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-darwin.so' (mach-o file, but is an incompatible architecture (have 'arm64', need 'x86_64')
The computer is a "MacBookPro18,2" with an Apple M1 Max chip (10 cores: 8 performance, and 2 efficiency; 32-core GPU). Some websites like these (https://en.wikipedia.org/wiki/Apple_silicon#Apple_M1_Pro_and_M1_Max, https://github.com/conda-forge/miniforge/blob/main/README.md) suggest 'Apple silicon' like the M1 Max needs software designed for the 'arm64' architecture. Using Terminal on the Mac, I checked the compatibility of Python 3.10.2 and the troublesome _multiarray_umath.cpython-310-darwin.so
file. Python 3.10.2 is a 'universal binary' with 2 architectures (x86_64 and arm64), and the file is exclusively arm64:
MacBook-Pro ~$ python3 --version
Python 3.10.2
MacBook-Pro ~$ whereis python3
/usr/bin/python3
MacBook-Pro ~$ which python3
/Library/Frameworks/Python.framework/Versions/3.10/bin/python3
MacBook-Pro ~$ file /Library/Frameworks/Python.framework/Versions/3.10/bin/python3
/Library/Frameworks/Python.framework/Versions/3.10/bin/python3: Mach-O universal binary with 2 architectures: [x86_64:Mach-O 64-bit executable x86_64] [arm64:Mach-O 64-bit executable arm64]
/Library/Frameworks/Python.framework/Versions/3.10/bin/python3 (for architecture x86_64): Mach-O 64-bit executable x86_64
/Library/Frameworks/Python.framework/Versions/3.10/bin/python3 (for architecture arm64): Mach-O 64-bit executable arm64
MacBook-Pro ~$ file /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-darwin.so
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-darwin.so: Mach-O 64-bit bundle arm64
So I am still puzzled by the error message, which says 'x86_64' is needed for something (hardware or software?) to run this script. Do you need a special environment to run PySpark scripts on an Apple M1 Max chip? As discussed previously, PySpark seems to work fine on the same computer in Python's interactive mode:
MacBook-Pro ~$ python3
Python 3.10.2 (v3.10.2:a58ebcc701, Jan 13 2022, 14:50:16) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyspark
>>> from pyspark.sql import SparkSession
>>> from pyspark.ml import *
>>> import numpy as np
>>>
Is there a way to resolve the error so a variety of pyspark.ml
and other imports will function normally in a Python script? Perhaps the settings in the ~/.bash_profile
need to be changed? Would a different version of the _multiarray_umath.cpython-310-darwin.so
file solve the problem, and if so, how would I obtain it? (Use a different version of Python?) I am seeking suggestions for code, settings, and/or actions. Perhaps there is an easy fix I have overlooked.