How to connect Spark to Kafka when both are running on Docker?

Question

I'm attempting to set up a quick POC on my Mac laptop (using Docker) to help demonstrate a streaming data ingestion flow using MySQL, Debezium, Kafka and Spark.

The MySQL / Debezium / Kafka environment is set up as follows:

version: '2'
services:
  zookeeper:
    image: quay.io/debezium/zookeeper:${DEBEZIUM_VERSION}
    ports:
     - 2181:2181
     - 2888:2888
     - 3888:3888
  kafka:
    image: quay.io/debezium/kafka:${DEBEZIUM_VERSION}
    ports:
     - 9092:9092
    links:
     - zookeeper
    environment:
     - ZOOKEEPER_CONNECT=zookeeper:2181
  mysql:
    image: quay.io/debezium/example-mysql:${DEBEZIUM_VERSION}
    ports:
     - 3306:3306
    environment:
     - MYSQL_ROOT_PASSWORD=debezium
     - MYSQL_USER=mysqluser
     - MYSQL_PASSWORD=mysqlpw
  connect:
    image: quay.io/debezium/connect:${DEBEZIUM_VERSION}
    ports:
     - 8083:8083
    links:
     - kafka
     - mysql
    environment:
     - BOOTSTRAP_SERVERS=kafka:9092
     - GROUP_ID=1
     - CONFIG_STORAGE_TOPIC=my_connect_configs
     - OFFSET_STORAGE_TOPIC=my_connect_offsets
     - STATUS_STORAGE_TOPIC=my_connect_statuses

This part is up and running. I'm able to connect to MySQL, change some values and see those changes flow through Debezium and Kafka.

I've also set up a stand-alone Spark 3.3 instance using a separate docker-compose as follows:

version: '3'
services:
  spark-master:
    image: docker.io/bitnami/spark:3.3
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    ports:
      - '8080:8080'
  spark-worker:
    image: docker.io/bitnami/spark:3.3
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark-master:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no

This part is also up and running and I am able to log onto the Spark UI.

My question is: What are the specific configuration or environment change do I need to make in order to be able to submit a Spark streaming job, which will read its data from the Kafka topic, given that both Spark and Kafka are running within their own Docker environments?

Why not simply put all containers in one Docker network? `docker network create`, then add networks to each image? https://docs.docker.com/compose/networking/ Or just use one compose file? — OneCricketeer, Jan 22 '23 at 14:42

score 1 · Accepted Answer · answered Jan 22 '23 at 11:18

I guess you can add host.docker.internal as an advertised address to Kafka and expose its port:

version: '2'
services:
  zookeeper:
    image: quay.io/debezium/zookeeper:${DEBEZIUM_VERSION}
    ports:
     - 2181:2181
     - 2888:2888
     - 3888:3888
  kafka:
    image: quay.io/debezium/kafka:${DEBEZIUM_VERSION}
    ports:
     - 9092:9092
    links:
     - zookeeper
    environment:
     - ZOOKEEPER_CONNECT=zookeeper:2181
     - ADVERTISED_HOST_NAME=host.docker.internal
    extra_hosts:                                                                
      - "host.docker.internal:host-gateway"

then add the host's IP to the Spark docker-compose file, and use it as the Kafka address.

version: '3'
services:
  spark-master:
    image: docker.io/bitnami/spark:3.3
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    ports:
      - '8080:8080'
    extra_hosts:                                                                
      - "host.docker.internal:host-gateway"
  spark-worker:
    image: docker.io/bitnami/spark:3.3
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark-master:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    extra_hosts:                                                                
      - "host.docker.internal:host-gateway"

While it may work, it's less optimal than a proper Docker bridge network as it has to traverse two/three network interfaces — OneCricketeer, Jan 22 '23 at 14:43

How to connect Spark to Kafka when both are running on Docker?

1 Answers1