0

I have built a decision tree model using Pyspark and I want to deploy that model using the docker container. I am using spark 1.6.0. The data is stored in the Hive tables and is located in my local machine. Is there a way to connect PySpark from my docker container to the hive tables in my local machine?

The data in my hive tables might get updated so I don't want to mount a drive or just copy the folder from local to my container but establish a connection between PySpark and Hive tables.

Abhishek Sawant
  • 1,209
  • 1
  • 8
  • 8

1 Answers1

0

If the data lives locally, you can still run Hive in a docker container, and mount the local folder inside the Hive container.

With docker-compose you can then easily link the containers and access the Hive server through localhost

Another option is to use --network="host" when running you PySpark container, and it will network through the host network. Might not be what you want for security reasons, depending on what you do.

see From inside of a Docker container, how do I connect to the localhost of the machine?

MrE
  • 19,584
  • 12
  • 87
  • 105
  • Hi, thanks for the answer. I tried with --network="host" but still I cannot access local Hadoop files. I can ping from the container to local but cannot access files. – Abhishek Sawant Jan 02 '19 at 15:59
  • and from local to Hive you can access the files? if not, it's not a docker issue. if you can, then it may be a permission issue with the user inside the container. – MrE Jan 02 '19 at 17:28