What would be a more efficient way to handle this?
I'll answer in three parts:
- Options to reduce memory consumption
- Minimalistic architecture for serverless computing
- How to get there
(I) Reducing Memory Consumption
Some handle large loads of data
If you find the scripts use more memory than you expect, the only way to reduce the memory requirements is to
- understand which parts of the scripts drive memory consumption
- refactor the scripts to use less memory
Typical issues that drive memory consumption are:
using the wrong data structure - e.g. if you have numerical data it is usually better to load the data into a numpy array as opposed to a Python array. If you create a lot of objects of custom classes, it can help to use __slots__
loading too much data into memory at once - e.g. if the processing can be split into several parts independent of each other, it may be more efficient to only load as much data as one part needs, then use a loop to process all the parts.
hold object references that are no longer needed - e.g. in the course of processing you create objects to represent or process some part of the data. If the script keeps a reference to such an object, it won't get released until the end of the program. One way around this is to use weak references, another is to use del
to dereference objects explicitely. Sometimes it also helps to call the garbage collector.
using an offline algorithm when there is an online version (for machine learning) - e.g. some of scikit's algorithms provide a version for incremental learning such as LinearRegression
=> SGDRegressior
or LogisticRegression
=> SGDClassifier
some do minor data science tasks
Some algorithms do require large amounts of memory. If using an online algorithm for incremental learning is not an option, the next best strategy is to use a service that only charges for the actual computation time/memory usage. That's what is typically referred to as serverless computing - you don't need to manage the servers (droplets) yourself.
The good news is that in principle the provider you use, Digital Ocean, provides a model that only charges for resources actually used. However this is not really serverless: it is still your task to create, start, stop and delete the droplets to actually benefit. Unless this process is fully automated, the fun factor is a bit low ;-)
(II) Minimalstic Architecture for Serverless Computing
a full dedicated 8GB / 16B droplet for my new script sounds a bit inefficient and expensive
Since your scripts run only occassionally / on a schedule, your droplet does not need to run or even exist all the time. So you could set this is up the following way:
Create a schedulding droplet. This can be of a small size. It's only purpose is to run a scheduler and to create a new droplet when a script is due, then submit the task for execution on this new worker droplet. The worker droplet can be of the specific size to accommodate the script, i.e. every script can have a droplet of whatever size it requires.
Create a generic worker. This is the program that runs upon creation of a new droplet by the scheduler. It receives as input the URL to the git repository where the actual script to be run is stored, and a location to store results. It then checks out the code from the repository, runs the scripts and stores the results.
Once the script has finished, the scheduler deletes the worker droplet.
With this approach there are still fully dedicated droplets for each script, but they only cost money while the script runs.
(III) How to get there
One option is to build an architecture as described above, which would essentially be an implementation of a minimalistic architecture for serverless computing. There are several Python libraries to interact with the Digital Ocean API. You could also use libcloud
as a generic multi-provider cloud API to make it easy(ier) to switch providers later on.
Perhaps the better alternative before building yourself is to evaluate one of the existing open source serverless options. An extensive curated list is provided by the good fellows at awesome-serverless. Note at the time of writing this, many of the open source projects are still in their early stages, the more mature options are commerical.
As always with engineering decisions, there is a trade-off between the time/cost required to build or host yourself v.s. the cost of using a readily-available commercial service. Ultimately that's a decision only you can take.