0

Our goal is to build a 'portable' postgres database we can deploy to arbitrary environments with data pre-loaded. Our current (possibly ill advised) approach is to build a postgres docker container with the data pre-loaded into the container. We want the data loaded at build time not at run time, so the default initialization approach provided by the postgres docker containers won't work.

Our basic approach is to use the base postgres image copy in a pg_dump sql file, run it using the provided initialization mechanism, wait for the server to finish loading the data, shutdown the server and then expose postgres as the command for the container. The dockerfile is

FROM docker.werally.in/postgres:11-alpine

ENV POSTGRES_USER 'postgres'
RUN echo "HELLO"
RUN apk --no-cache add bash \
    curl \
    vim


RUN mkdir -p /docker-entrypoint-initdb.d

# This script initializes the data using data.dump as source data
COPY ./initialize-databases.sh /docker-entrypoint-initdb.d
COPY ./data.dump /docker-entrypoint-initdb.d

# This script just runs the default docker-entrypoint for the PG container, which sets 
# up the postgres server and runs initialize-databases.sh
COPY ./docker-entrypoint-wrapper.sh /usr/local/bin/docker-entrypoint-wrapper.sh
# Run the wrapper script, at the end we run ls on /var/lib/postgresql/data, and it
# has a whole DBs worth of data
RUN chmod +x /usr/local/bin/docker-entrypoint-wrapper.sh && /usr/local/bin/docker-entrypoint-wrapper.sh && ls /var/lib/postgresql/data

# Set the user to postgres so we can actually run the server
USER postgres

# Now we run ls on the same directory and there's no data WTF?
RUN ls /var/lib/postgresql/data

CMD ["postgres"]

First things first, what the heck is going on. Why does a gig or so of data vanish from that directory. I'm absolutely baffled, but I also have a pretty hazy understanding of the details of docker layers.

Second things second, what are alternative approaches here? The obvious one is using filesystem level data backups, but that's a headache since we have no ability to take filesystem snapshots from the relevant source databases, and so would need to generate the filesystem snapshots from SQL snapshots which I would like. to avoid. But someone else has to have solved the problem of 'I need a copy of postgres database A that I can deploy to environments B, C, D and E without waiting for a 10 minute DB restore each time'

Nick Bailey
  • 3,078
  • 2
  • 11
  • 13
  • Start the postgresql image, then on the container install whatever you want, then `commit` the container to an image [See docs](https://docs.docker.com/engine/reference/commandline/commit/) – Max Mar 17 '21 at 23:54
  • `docker commit` is almost never a good workflow, and I'd recommend never running it at all. For the standard Docker Hub database images, [`docker commit` doesn't persist data](https://stackoverflow.com/questions/27377876/docker-postgres-with-initial-data-is-not-persisted-over-commits) so it doesn't help this question either. – David Maze Mar 18 '21 at 11:04
  • 1
    You might read through the answers to [docker postgres with initial data is not persisted over commits](https://stackoverflow.com/questions/27377876/docker-postgres-with-initial-data-is-not-persisted-over-commits) which have a technical explanation and a workaround. [This answer](https://stackoverflow.com/questions/29600369/starting-and-populating-a-postgres-container-in-docker/54946350#54946350) also suggests a binary-format database dump might be much faster than a text-format one, and then can be wired into the standard `/docker-entrypoint-initdb.d` setup. – David Maze Mar 18 '21 at 11:13
  • Yeah. A binary format database dump would likely be faster. We'll clock it for the large datasets and see if it's feasible. I think our current conclusion is 'this is the wrong way to try to solve this problem, try to solve it a different way. Still confused about why the data vanishes though. – Nick Bailey Mar 18 '21 at 16:04

0 Answers0