0

The challenge is to run a set of data processing and data science scripts that consume more memory than expected.

Here are my requirements:

  • Running 10-15 Python 3.5 scripts via Cron Scheduler
  • These different 10-15 scripts each take somewhere between 10 seconds to 20 minutes to complete
  • They run on different hours of the day, some of them run every 10 minute while some run once a day
  • Each script logs what it has done so that I can take a look at it later if something goes wrong
  • Some of the scripts sends e-mails to me and to my team mates
  • None of the scripts have an HTTP/web server component; they all run on Cron schedules and not user-facing
  • All the scripts' code is fed from my Github repository; when scripts wake up, they first do a git pull origin master and then start executing. That means, pushing to master causes all scripts to be on the latest version.

Here is what I currently have:

  • Currently I am using 3 Digital Ocean servers (droplets) for these scripts
  • Some of the scripts require a huge amount of memory (I get segmentation fault in droplets with less than 4GB of memory)
  • I am willing to introduce a new script that might require even larger memory (the new script currently faults in a 4GB droplet)
  • The setup of the droplets are relatively easy (thanks to Python venv) but not to the point of executing a single command to spin off a new droplet and set it up

Having a full dedicated 8GB / 16B droplet for my new script sounds a bit inefficient and expensive.

What would be a more efficient way to handle this?

miraculixx
  • 10,034
  • 2
  • 41
  • 60
Guven
  • 2,280
  • 2
  • 20
  • 34
  • Could you share some details on what your scripts do? It is not typical for python programs "as such" to take this amount of memory, so it's related to the specifics of your program. Sounds like you load a lot of data, perhaps into a suboptimal way or data structure. Knowing more will help to answer your question. – miraculixx Feb 07 '19 at 17:55
  • Some handle large loads of data while some do minor data science tasks. When I run the scripts on my own Mac, I can see that they use around 2.5GB of memory. I am not sure why a 4GB droplet can't handle 2GB memory consumption but that is another question probably. In any case, it would be hard to re-write those scripts now. – Guven Feb 07 '19 at 18:27
  • ok I took a stab at answering your somewhat generic question, which howvever I find quite interesting since it has a few nice angles to it. I also took liberty to change the title of your question to make it a bit more specific. Hope this helps. – miraculixx Feb 08 '19 at 16:25

1 Answers1

2

What would be a more efficient way to handle this?

I'll answer in three parts:

  1. Options to reduce memory consumption
  2. Minimalistic architecture for serverless computing
  3. How to get there

(I) Reducing Memory Consumption

Some handle large loads of data

If you find the scripts use more memory than you expect, the only way to reduce the memory requirements is to

  1. understand which parts of the scripts drive memory consumption
  2. refactor the scripts to use less memory

Typical issues that drive memory consumption are:

  • using the wrong data structure - e.g. if you have numerical data it is usually better to load the data into a numpy array as opposed to a Python array. If you create a lot of objects of custom classes, it can help to use __slots__

  • loading too much data into memory at once - e.g. if the processing can be split into several parts independent of each other, it may be more efficient to only load as much data as one part needs, then use a loop to process all the parts.

  • hold object references that are no longer needed - e.g. in the course of processing you create objects to represent or process some part of the data. If the script keeps a reference to such an object, it won't get released until the end of the program. One way around this is to use weak references, another is to use del to dereference objects explicitely. Sometimes it also helps to call the garbage collector.

  • using an offline algorithm when there is an online version (for machine learning) - e.g. some of scikit's algorithms provide a version for incremental learning such as LinearRegression => SGDRegressior or LogisticRegression => SGDClassifier

some do minor data science tasks

Some algorithms do require large amounts of memory. If using an online algorithm for incremental learning is not an option, the next best strategy is to use a service that only charges for the actual computation time/memory usage. That's what is typically referred to as serverless computing - you don't need to manage the servers (droplets) yourself.

The good news is that in principle the provider you use, Digital Ocean, provides a model that only charges for resources actually used. However this is not really serverless: it is still your task to create, start, stop and delete the droplets to actually benefit. Unless this process is fully automated, the fun factor is a bit low ;-)

(II) Minimalstic Architecture for Serverless Computing

a full dedicated 8GB / 16B droplet for my new script sounds a bit inefficient and expensive

Since your scripts run only occassionally / on a schedule, your droplet does not need to run or even exist all the time. So you could set this is up the following way:

  1. Create a schedulding droplet. This can be of a small size. It's only purpose is to run a scheduler and to create a new droplet when a script is due, then submit the task for execution on this new worker droplet. The worker droplet can be of the specific size to accommodate the script, i.e. every script can have a droplet of whatever size it requires.

  2. Create a generic worker. This is the program that runs upon creation of a new droplet by the scheduler. It receives as input the URL to the git repository where the actual script to be run is stored, and a location to store results. It then checks out the code from the repository, runs the scripts and stores the results.

  3. Once the script has finished, the scheduler deletes the worker droplet.

With this approach there are still fully dedicated droplets for each script, but they only cost money while the script runs.

(III) How to get there

One option is to build an architecture as described above, which would essentially be an implementation of a minimalistic architecture for serverless computing. There are several Python libraries to interact with the Digital Ocean API. You could also use libcloud as a generic multi-provider cloud API to make it easy(ier) to switch providers later on.

Perhaps the better alternative before building yourself is to evaluate one of the existing open source serverless options. An extensive curated list is provided by the good fellows at awesome-serverless. Note at the time of writing this, many of the open source projects are still in their early stages, the more mature options are commerical.

As always with engineering decisions, there is a trade-off between the time/cost required to build or host yourself v.s. the cost of using a readily-available commercial service. Ultimately that's a decision only you can take.

miraculixx
  • 10,034
  • 2
  • 41
  • 60