0

I checked this question and tried every single answer except rebuilding Hadoop (which is failing with endless errors so far). My hope is that the binaries from the official Hadoop distribution will do, but I can't make it work.

Dockerfile:

FROM python:3.10
RUN apt update
RUN apt install -y default-jdk
RUN pip install pyspark==3.3.2 delta-spark==2.2.0
ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
ENV HADOOP_HOME /app/hadoop-3.3.4
ENV HADOOP_OPTS -Djava.library.path=/app/hadoop-3.3.4/lib/native
ENV HADOOP_PREFIX /app/hadoop-3.3.4
ENV HADOOP_COMMON_HOME /app/hadoop-3.3.4
ENV HADOOP_COMMON_LIB_NATIVE_DIR /app/hadoop-3.3.4/lib/native
ENV HADOOP_CONF_DIR /app/hadoop-3.3.4/etc/hadoop
ENV HADOOP_HDFS_HOME /app/hadoop-3.3.4
ENV LD_LIBRARY_PATH /usr/lib/hadoop/lib/native
ENV JAVA_LIBRARY_PATH /app/hadoop-3.3.4/lib/native
WORKDIR /app

main.py:

from pyspark.sql.session import SparkSession
from delta import configure_spark_with_delta_pip

builder = SparkSession.builder.appName("my") \
    .config('spark.sql.extensions', 'io.delta.sql.DeltaSparkSessionExtension') \
    .config('spark.sql.catalog.spark_catalog', 'org.apache.spark.sql.delta.catalog.DeltaCatalog')

spark = configure_spark_with_delta_pip(builder) \
    .getOrCreate()
% tar -zxf hadoop-3.3.4.tar.gz
% docker build . -t mydelta
% docker run -it --rm -v `pwd`:/app mydelta bash

% python main.py
...
23/02/25 02:13:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
...

Functionally it works well, but I would like to get rid of the warning.

% printenv | grep "HADOOP\|JAVA"
HADOOP_OPTS=/app/hadoop-3.3.4/lib/native
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
HADOOP_COMMON_HOME=/app/hadoop-3.3.4
HADOOP_CONF_DIR=/app/hadoop-3.3.4/etc/hadoop
HADOOP_HOME=/app/hadoop-3.3.4
HADOOP_HDFS_HOME=/app/hadoop-3.3.4
JAVA_LIBRARY_PATH=/app/hadoop-3.3.4/lib/native
HADOOP_COMMON_LIB_NATIVE_DIR=/app/hadoop-3.3.4/lib/native
HADOOP_PREFIX=/app/hadoop-3.3.4

% ls -l $HADOOP_COMMON_LIB_NATIVE_DIR
total 166800
drwxr-xr-x 6 root root       192 Jul 29  2022 examples
-rw-r--r-- 1 root root   1507316 Jul 29  2022 libhadoop.a
lrwxr-xr-x 1 root root        18 Jul 29  2022 libhadoop.so -> libhadoop.so.1.0.0
-rwxr-xr-x 1 root root    803040 Jul 29  2022 libhadoop.so.1.0.0
-rw-r--r-- 1 root root   1741256 Jul 29  2022 libhadooppipes.a
-rw-r--r-- 1 root root    754382 Jul 29  2022 libhadooputils.a
-rw-r--r-- 1 root root    551572 Jul 29  2022 libhdfs.a
lrwxr-xr-x 1 root root        16 Jul 29  2022 libhdfs.so -> libhdfs.so.0.0.0
-rwxr-xr-x 1 root root    333656 Jul 29  2022 libhdfs.so.0.0.0
-rw-r--r-- 1 root root 106649802 Jul 29  2022 libhdfspp.a
lrwxr-xr-x 1 root root        18 Jul 29  2022 libhdfspp.so -> libhdfspp.so.0.1.0
-rwxr-xr-x 1 root root  44450288 Jul 29  2022 libhdfspp.so.0.1.0
-rw-r--r-- 1 root root  10010090 Jul 29  2022 libnativetask.a
lrwxr-xr-x 1 root root        22 Jul 29  2022 libnativetask.so -> libnativetask.so.1.0.0
-rwxr-xr-x 1 root root   3980832 Jul 29  2022 libnativetask.so.1.0.0

% ldd --version
ldd (Debian GLIBC 2.31-13+deb11u5) 2.31

% ldd $HADOOP_COMMON_LIB_NATIVE_DIR/libhadoop.so.1.0.0
    linux-vdso.so.1 (0x00007ffd405a5000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f36012ea000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f36012c8000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f36010f3000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f360151b000)

% file libhadoop.so.1.0.0
libhadoop.so.1.0.0: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=30ce002bb1ee648ac42090156300dbf4f5f9c1c4, with debug_info, not stripped

% objdump -f libhadoop.so.1.0.0
libhadoop.so.1.0.0:     file format elf64-x86-64
architecture: i386:x86-64, flags 0x00000150:
HAS_SYMS, DYNAMIC, D_PAGED
start address 0x0000000000006bd0

Can you spot if anything is wrong with my libhadoop.so.1.0.0? What else can I try?

greatvovan
  • 2,439
  • 23
  • 43
  • Does your code work? You can ignore that warning – OneCricketeer Feb 27 '23 at 00:46
  • @OneCricketeer, yes, as mentioned in the question, the code works. If I understand correctly, there is a performance benefit of using native libraries, isn't it? – greatvovan Feb 27 '23 at 07:54
  • Sure, but the included ones are built for 32bit - https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/NativeLibraries.html – OneCricketeer Feb 27 '23 at 15:00
  • @OneCricketeer `file libhadoop.so.1.0.0:` `libhadoop.so.1.0.0: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV)...` – greatvovan Feb 27 '23 at 18:25
  • Yes, I see that from your post... I have not done any of this in Docker, so I am really sure. You may need to modify `spark-env.sh` file – OneCricketeer Feb 27 '23 at 21:12

0 Answers0