0

I'm trying to use https://docs.docker.com/develop/develop-images/multistage-build/

I need openjdk 8 and latest panda on alpine (I'm installing spark / pyspark)

I initially tried using FROM openjdk:8-alpine and then installing all python3 / pandas , but it turns out installing pandas is rather hard in alpine and you need latest alpine docker image (Installing pandas in docker Alpine)

So I need FROM openjdk:8-alpine and From alpine:latest

My question is how do I know which directory(?) to copy from the each step?

If I do

FROM openjdk:8-alpine
From alpine:latest

I'll need copy java8 related stuff from the openjdk:8-alpine

If I do reverse

From alpine:latest
# install panda 
FROM openjdk:8-alpine

I'll need copy (what??)

eugene
  • 39,839
  • 68
  • 255
  • 489
  • 1
    I think you need to install JDK manually on the image , multistage will not be helpfull in your case – LinPy Dec 10 '19 at 10:42
  • This should help: https://pythonspeed.com/articles/multi-stage-docker-python/ – Shubham Dec 10 '19 at 10:43
  • BTW, even if you are able to copy all the libraries of python/pandas in the `openjdk:8-alpine` how would you use them? They would still require `python` binary in the final image, right? – Shubham Dec 10 '19 at 10:47
  • 1
    Is there a specific reason you need both in the same image? Typical practice if you have two application components in separate programs is to run them in two separate containers, and use a network call (like HTTP) between them. – David Maze Dec 10 '19 at 10:49
  • I need to run spark (which requires java) and pyspark which is python.. I guess I could separate them but probably easier to setup single docker.. @DavidMaze – eugene Dec 10 '19 at 10:51
  • 1
    @eugene A general rule is to run only one application per container, even if they're on the same stack (eg: both java apps). It is possible to run multiple application on one container but that too has complexities. See https://docs.docker.com/config/containers/multi-service_container/ – Bernard Dec 10 '19 at 11:04
  • @Bernard thanks for the tip, I'll take that into account, but I doubt you can separate java /python in spark / pyspark app. (top google hits for spark docker file example have java/python in one dockerfile such as https://github.com/Fokko/docker-pyspark/blob/master/Dockerfile) – eugene Dec 10 '19 at 11:10

1 Answers1

0

When you use multi-stage build, you typically create an artefact (a compiled app for instance) during the first phase, and copy it to a slimmer base image on the second phase. Everything from the first stage is discarded when the final image is created.

From your comments, I think I understand that you need to start from an image that has both JDK8 AND the latest alpine. Multi-stage build doesn't help here. You'd end up just copying the JDK to the alpine:latest final stage.

I would instead copy the original Dockerfile jdk:8-alpine, change the first line to FROM alpine:3.10 and create your own base image.

If you need pyspark based on this image, you copy the original dockerfile and replace the first line FROM openjdk:8 with your base image created previously.

Bernard
  • 16,149
  • 12
  • 63
  • 66