Our goal is to build a 'portable' postgres database we can deploy to arbitrary environments with data pre-loaded. Our current (possibly ill advised) approach is to build a postgres docker container with the data pre-loaded into the container. We want the data loaded at build time not at run time, so the default initialization approach provided by the postgres docker containers won't work.
Our basic approach is to use the base postgres image copy in a pg_dump sql file, run it using the provided initialization mechanism, wait for the server to finish loading the data, shutdown the server and then expose postgres as the command for the container. The dockerfile is
FROM docker.werally.in/postgres:11-alpine
ENV POSTGRES_USER 'postgres'
RUN echo "HELLO"
RUN apk --no-cache add bash \
curl \
vim
RUN mkdir -p /docker-entrypoint-initdb.d
# This script initializes the data using data.dump as source data
COPY ./initialize-databases.sh /docker-entrypoint-initdb.d
COPY ./data.dump /docker-entrypoint-initdb.d
# This script just runs the default docker-entrypoint for the PG container, which sets
# up the postgres server and runs initialize-databases.sh
COPY ./docker-entrypoint-wrapper.sh /usr/local/bin/docker-entrypoint-wrapper.sh
# Run the wrapper script, at the end we run ls on /var/lib/postgresql/data, and it
# has a whole DBs worth of data
RUN chmod +x /usr/local/bin/docker-entrypoint-wrapper.sh && /usr/local/bin/docker-entrypoint-wrapper.sh && ls /var/lib/postgresql/data
# Set the user to postgres so we can actually run the server
USER postgres
# Now we run ls on the same directory and there's no data WTF?
RUN ls /var/lib/postgresql/data
CMD ["postgres"]
First things first, what the heck is going on. Why does a gig or so of data vanish from that directory. I'm absolutely baffled, but I also have a pretty hazy understanding of the details of docker layers.
Second things second, what are alternative approaches here? The obvious one is using filesystem level data backups, but that's a headache since we have no ability to take filesystem snapshots from the relevant source databases, and so would need to generate the filesystem snapshots from SQL snapshots which I would like. to avoid. But someone else has to have solved the problem of 'I need a copy of postgres database A that I can deploy to environments B, C, D and E without waiting for a 10 minute DB restore each time'