4

I'm developing a Django app. Use-case scenario is this:

50 users, each one can store up to 300 time series and each time serie has around 7000 rows.

Each user can ask at any time to retrieve all of their 300 time series and ask, for each of them, to perform some advanced data analysis on the last N rows. The data analysis cannot be done in SQL but in Pandas, where it doesn't take much time... but retrieving 300,000 rows in separate dataframes does!

Users can also ask results of some analysis that can be performed in SQL (like aggregation+sum by date) and that is considerably faster (to the point where I wouldn't be writing this post if that was all of it).

Browsing and thinking around, I've figured storing time series in SQL is not a good solution (read here).

Ideal deploy architecture looks like this (each bucket is a separate server!):

enter image description here

Problem: time series in SQL are too slow to retrieve in a multi-user app.

Researched solutions (from this article):

Here are some problems:

1) Although these solutions are massively faster for pulling millions of rows time series into a single dataframe, I might need to pull around 500.000 rows into 300 different dataframes. Would that still be as fast?

This is the current db structure I'm using:

class TimeSerie(models.Model):
    ...

class TimeSerieRow(models.Model):
    date = models.DateField()
    timeserie = models.ForeignKey(timeserie)
    number = ...
    another_number = ...

And this is the bottleneck in my application:

for t in TimeSerie.objects.filter(user=user):
    q = TimeSerieRow.objects.filter(timeserie=t).orderby("date")
    q = q.filter( ... time filters ...)
    df = pd.DataFrame(q.values())
    # ... analysis on df

2) Even if PyStore or Arctic can do that faster, that'd mean I'd loose the ability to decouple my db from the Django instances, effictively using resources of one machine way better, but being stuck to use only one and not being scalable (or use as many separate databases as machines). Can PyStore/Arctic avoid this and provide an adapter for remote storage?

Is there a Python/Linux solution that can solve this problem? Which architecture I can use to overcome it? Should I drop the scalability of my app and/or accept that every N new users I'll have to spawn a separate database?

Saturnix
  • 10,130
  • 17
  • 64
  • 120
  • Why can't you decouple your "db" if you use Arctic or PyStore? Both are storages, they can be run on separate instances than your Django app. And I think the article you reference is the best answer to your question. You should try them and see which one best fits your requirements. – dirkgroten Sep 04 '19 at 14:08
  • Yes, they can run on separate instances but how would I transfer data between the 2? I'd have to write a layer between them, and it must be as efficient as PyStore in r/w performance, which defies the purpose of having a pre-made library in the first place. If, for instance, the storage machine reads with PyStore and sends JSON, then I'd have to parse JSON in Django, adding massive overhead... I'd be even slower than right now with SQL+Pandas. I suppose PyStore/Arctic *requires* to run on the drive that uses the data in order to benefit for its performance: please correct me if I'm wrong – Saturnix Sep 04 '19 at 14:11
  • 1
    And don't forget other options mentioned, like InfluxDB. – dirkgroten Sep 04 '19 at 14:11
  • No it doesn't. Arctic uses mongoDB to store the data and that can be anywhere. pyStore uses a file-like system to store the data and that could be anywhere where it can access files. On AWS, you could use S3 or an EFS volume shared by multiple EC2 instances that run pystore. – dirkgroten Sep 04 '19 at 14:20
  • How would you transfer a Pandas dataframe between one machine and another? Wouldn't you have to serialize the data (if yes: which format?). Or does pyStore has serialization/deserialization support? I've figured serialize/deserialize would be a massive overhead but if that's not the case this could be the solution... I'll also look into InfluxDB. Thanks for the tips! – Saturnix Sep 04 '19 at 14:22
  • Oh, I get it! So you say the file system or mongoDB should be shared, and the Python library on the same machine as Django. Makes sense! – Saturnix Sep 04 '19 at 14:25
  • Yes, shared db/file system. Same as with PostgreSQL: the db is shared, the psycopg2 library is installed on each instance running Django. Note that you can pickle panda dataframes (that's how they get saved to file) so that's also how I would transport them if you'd need to do that. I don't know how long it takes to pickle a massive dataframe but I'm sure it's much faster than manual serialization. – dirkgroten Sep 04 '19 at 14:34
  • You can post your proposal as an answer and I’ll accept – Saturnix Sep 04 '19 at 17:29

1 Answers1

3

The article you refer to in your post is probably the best answer to your question. Clearly good research and a few good solutions being proposed (don't forget to take a look at InfluxDB).

Regarding the decoupling of the storage solution from your instances, I don't see the problem:

  • Arctic uses mongoDB as a backing store
  • pyStore uses a file system as a backing store
  • InfluxDB is a database server on its own

So as long as you decouple the backing store from your instances and make them shared among instances, you'll have the same setup as for your posgreSQL database: mongoDB or InfluxDB can run on a separate centralised instance; the file storage for pyStore can be shared, e.g. using a shared mounted volume. The python libraries that access these stores of course run on your django instances, like psycopg2 does.

dirkgroten
  • 20,112
  • 2
  • 29
  • 42