I checked this question and tried every single answer except rebuilding Hadoop (which is failing with endless errors so far). My hope is that the binaries from the official Hadoop distribution will do, but I can't make it work.
Dockerfile:
FROM python:3.10
RUN apt update
RUN apt install -y default-jdk
RUN pip install pyspark==3.3.2 delta-spark==2.2.0
ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
ENV HADOOP_HOME /app/hadoop-3.3.4
ENV HADOOP_OPTS -Djava.library.path=/app/hadoop-3.3.4/lib/native
ENV HADOOP_PREFIX /app/hadoop-3.3.4
ENV HADOOP_COMMON_HOME /app/hadoop-3.3.4
ENV HADOOP_COMMON_LIB_NATIVE_DIR /app/hadoop-3.3.4/lib/native
ENV HADOOP_CONF_DIR /app/hadoop-3.3.4/etc/hadoop
ENV HADOOP_HDFS_HOME /app/hadoop-3.3.4
ENV LD_LIBRARY_PATH /usr/lib/hadoop/lib/native
ENV JAVA_LIBRARY_PATH /app/hadoop-3.3.4/lib/native
WORKDIR /app
main.py:
from pyspark.sql.session import SparkSession
from delta import configure_spark_with_delta_pip
builder = SparkSession.builder.appName("my") \
.config('spark.sql.extensions', 'io.delta.sql.DeltaSparkSessionExtension') \
.config('spark.sql.catalog.spark_catalog', 'org.apache.spark.sql.delta.catalog.DeltaCatalog')
spark = configure_spark_with_delta_pip(builder) \
.getOrCreate()
% tar -zxf hadoop-3.3.4.tar.gz
% docker build . -t mydelta
% docker run -it --rm -v `pwd`:/app mydelta bash
% python main.py
...
23/02/25 02:13:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
...
Functionally it works well, but I would like to get rid of the warning.
% printenv | grep "HADOOP\|JAVA"
HADOOP_OPTS=/app/hadoop-3.3.4/lib/native
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
HADOOP_COMMON_HOME=/app/hadoop-3.3.4
HADOOP_CONF_DIR=/app/hadoop-3.3.4/etc/hadoop
HADOOP_HOME=/app/hadoop-3.3.4
HADOOP_HDFS_HOME=/app/hadoop-3.3.4
JAVA_LIBRARY_PATH=/app/hadoop-3.3.4/lib/native
HADOOP_COMMON_LIB_NATIVE_DIR=/app/hadoop-3.3.4/lib/native
HADOOP_PREFIX=/app/hadoop-3.3.4
% ls -l $HADOOP_COMMON_LIB_NATIVE_DIR
total 166800
drwxr-xr-x 6 root root 192 Jul 29 2022 examples
-rw-r--r-- 1 root root 1507316 Jul 29 2022 libhadoop.a
lrwxr-xr-x 1 root root 18 Jul 29 2022 libhadoop.so -> libhadoop.so.1.0.0
-rwxr-xr-x 1 root root 803040 Jul 29 2022 libhadoop.so.1.0.0
-rw-r--r-- 1 root root 1741256 Jul 29 2022 libhadooppipes.a
-rw-r--r-- 1 root root 754382 Jul 29 2022 libhadooputils.a
-rw-r--r-- 1 root root 551572 Jul 29 2022 libhdfs.a
lrwxr-xr-x 1 root root 16 Jul 29 2022 libhdfs.so -> libhdfs.so.0.0.0
-rwxr-xr-x 1 root root 333656 Jul 29 2022 libhdfs.so.0.0.0
-rw-r--r-- 1 root root 106649802 Jul 29 2022 libhdfspp.a
lrwxr-xr-x 1 root root 18 Jul 29 2022 libhdfspp.so -> libhdfspp.so.0.1.0
-rwxr-xr-x 1 root root 44450288 Jul 29 2022 libhdfspp.so.0.1.0
-rw-r--r-- 1 root root 10010090 Jul 29 2022 libnativetask.a
lrwxr-xr-x 1 root root 22 Jul 29 2022 libnativetask.so -> libnativetask.so.1.0.0
-rwxr-xr-x 1 root root 3980832 Jul 29 2022 libnativetask.so.1.0.0
% ldd --version
ldd (Debian GLIBC 2.31-13+deb11u5) 2.31
% ldd $HADOOP_COMMON_LIB_NATIVE_DIR/libhadoop.so.1.0.0
linux-vdso.so.1 (0x00007ffd405a5000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f36012ea000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f36012c8000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f36010f3000)
/lib64/ld-linux-x86-64.so.2 (0x00007f360151b000)
% file libhadoop.so.1.0.0
libhadoop.so.1.0.0: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=30ce002bb1ee648ac42090156300dbf4f5f9c1c4, with debug_info, not stripped
% objdump -f libhadoop.so.1.0.0
libhadoop.so.1.0.0: file format elf64-x86-64
architecture: i386:x86-64, flags 0x00000150:
HAS_SYMS, DYNAMIC, D_PAGED
start address 0x0000000000006bd0
Can you spot if anything is wrong with my libhadoop.so.1.0.0? What else can I try?