I have personally used s3fs
to solve this problem in the past. Using S3 as a mounted filesystem has some caveats which you would be wise to familiarize yourself with (because you are treating something that is not a filesystem like it is a filesystem, a classic leaky abstraction problem), but if your workflow is relatively simple and does not have the possibility for race conditions you should be able to do it with some confidence (especially now that as of Dec 2020 AWS S3 has released read-after-write consistency automatically for all applications).
To answer your other question:
I could use s3fs-fuse but I was told that I wont be able to install or store any of the files from S3 on EC2 instances on AWS Batch instances, which can then be mounted in docker. - is there a way to do this by including some code in the AMI that will copy files from s3 to instance?
If you use s3fs
to mount your S3 bucket as a filesystem within docker, you don't need to worry about copying files from S3 to the instance, indeed the whole point of using s3fs
is that you can access all your files in S3 from the container without having to move then off of S3.
Say for instance you mount your S3 bucket s3://my-test-bucket
to /data
in the container. You can then run your program like my-executable --input /data/my-s3-file --output /data/my-s3-output
as if the input file was right there on the local filesystem. When its done you can see the output file will be on S3 in s3://my-test-bucket/my-s3-output
. This can simply your workflow / cut down on glue code quite a bit.
My dockerfile for my s3fs
AWS batch container looks like this:
FROM ubuntu:18.04
RUN apt-get -y update && apt-get -y install curl wget build-essential automake libcurl4-openssl-dev libxml2-dev pkg-config libssl-dev libfuse-dev parallel
RUN wget https://github.com/s3fs-fuse/s3fs-fuse/archive/v1.86.tar.gz && \
tar -xzvf v1.86.tar.gz && \
cd s3fs-fuse-1.86 && \
./autogen.sh && \
./configure --prefix=/usr && \
make && \
make install && \
rm -rf s3fs-fuse-1.86 v1.86.tar.gz
RUN mkdir /data
COPY entrypoint.sh /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]
entrypoint.sh
is a convenience for always running the s3fs mount before the main program (this breaks the paradigm of one process per docker container but, I don't think its cause for major concern here). It looks like this:
#!/bin/bash
bucket=my-bucket
s3fs ${bucket} /data -o ecs
echo "Mounted ${bucket} to /data"
exec "$@"
Note related answer here: https://stackoverflow.com/a/60556131/1583239