What is the most efficient solution for hundreds download requests in minute for HDFS folder

Question

In my company, we have a continuous learning process. Every 5-10 minutes we create a new model in HDFS. Model is a folder of several files:

model ~ 1G (binary file)
model metadata 1K (text file)
model features 1K (csv file) ...

On the other hand, we have hundreds of model serving instances, that need to download the model into the local filesystem once 5-10 minutes and serve from it. Currently, we are using WebFS from our service (java FileSystem client), but it probably creates a load to our Hadoop cluster, since it redirects requests to the concrete data nodes.

We consider to using HTTPFs service. Does it have a caching capability? So the first request will get a folder to service memory, and the next requests will use the already downloaded results?

What other technology/solution could be used for such use-case?

score 0 · Answer 1 · answered Feb 24 '20 at 16:31

We have found a nice solution.

It could be used for Hadoop to reduce the read load or for Google/S3 buckets to reduce the cost.

We simply set-up a couple of Ngnix servers, and configure them as a proxy with file cache 2 minutes.

In that way, only Ngnix machines will download the data from the Hadoop cluster.

And all serving machines (that might be hundreds) will pull the data from the Nginx server, where it will be already cached

What is the most efficient solution for hundreds download requests in minute for HDFS folder

1 Answers1