0

I am working on software that processes time series. Sometimes these are very long (>10 million data points). Our software is very usable for shorter time series but gets unusably bogged down for these long ones. When looking at the RAM usage, it's almost 10x what all the time series data together occupy.

When doing some tests, it's clear that a lot of memory is used by matplotlib, which we are using to plot the time series. Using a separate piece of code that includes ONLY loading of the time series from a file and plotting, I can see that when going from loading only (with the plotting command commented out) to plotting, the memory usage goes up almost 3-fold. This is true whether or not the whole time range is visible within the given axis limits, although passing only a small slice of the series (numpy array) to matplotlib DOES proportionally reduce the excess memory.

Given that we expect users to scroll through the time series and only view short chunks at a time, it would be much better to have matplotlib only fetch the visible portion of the numpy array, grabbing new elements as the user scrolls or zooms. In fact, it would likely be preferable to replace the X and Y arrays with generators that re-compute the values on the fly as the plot needs them, possibly caching points just outside the limits to make scrolling faster. The X values in particular are simple linspaces that would likely be best not stored at all, given that computing them should be as fast as a lookup into a huge array, never mind storing them once in the outer software AND also in matplotlib.

I know we could try to "fake" this by capturing user events sent to the plot and re-sending new X and Y arrays all the time, but this feels clunky, prone to all sorts of corner cases where things get out of sync, and like trying to take over from the plotting library things it "wants" to do itself. At some point it would become easier just to write our own simple plotting routine in C/C++ that does the computations and draws lines using a graphics API. In fact, the nearest closed-source competitor to our software seems to be doing just that, given that it's super snappy and uses an amount of RAM that is a mere fraction of the size of a time series. But, we want our software to be extensible by users without a deep understanding of the internals of our code.

Is there a standard way of handling this, or is this just too far from the "spirit" of matplotlib to be worth using it? And in that case, is there an alternative Python plotting library with exactly this use case in mind? I would imagine that data scientists working with terabytes of data would want a way to graphically explore it without the plotting code eating terabytes of storage itself...

biohacker
  • 142
  • 3
  • I wouldn't be too afraid to catch the events (as here: https://stackoverflow.com/questions/31490436/matplotlib-finding-out-xlim-and-ylim-after-zoom) and supply the data that is needed - in a reasonable resolution (not more than pixels on the screen). There is still a lot of stuff done by matplotlib that you would need to program from scratch without. – Dr. V Nov 30 '22 at 22:32
  • I'm not sure I understand what you think is 'fake' about you sending the relevant slice of data to the plotting library in response to user events, vs. the plotting library retrieving the slice of data in response to user events. The former makes more sense, since your application has a better understanding of the meaning of the user events than the plotting library could ever have in a generic sense, and it's not like there's additional overhead. `matplotlib` includes an [interactive mode](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.ion.html) that makes this relatively easy. – Grismar Nov 30 '22 at 23:00
  • @Grismar: I assumed that passing a whole new pair of X and Y arrays to matplotlib every time the user makes a small navigation gesture would involve considerable overhead due to the updating of matplotlib's internal copy of the data in addition to whatever refreshing matplotlib does on its own when the viewport changes. I guess maybe this isn't the case. Note that I'm aware that re-slicing the arrays themselves before passing them to set_data() has essentially no overhead (it's just a bounds change), but matplotlib must keep a lot of state internally given how much RAM it eats. – biohacker Nov 30 '22 at 23:20
  • It may have considerable overhead, but here there is a tradeoff: matplotlib would know how much of an overhead could be avoided by only updating specific parts of the data, but it can't know what part of the data might change (e.g. only y values, with x fixed); you would know what data might change, but don't know exactly how to avoid needless updates on matplotlib. The latter is easier to amend, and you should look for the functions that matplotlib offers that would have the least impact, while still allowing you to update all of the changing data. – Grismar Dec 01 '22 at 04:40
  • @Grismar: I guess the underlying issue here is that there are optimizations that can be performed for time series that aren't for a general scatter plot (where the array of X values may be unsorted, never mind the spacing being non-uniform). Given a pair of X limits (xmin, xmax), determining the start/end indices in the X array (and hence the Y array) is easy. So it makes sense to just have ONE copy of both arrays and for the draw loop to iterate through a slice of it, which changes as the user scrolls/ drags the plot. This requires no storage other than the arrays themselves. – biohacker Dec 01 '22 at 22:17
  • Again, all of that is true, but since matplotlib has no way of knowing how the source data is structured, it makes more sense for some code around the source data to select and preprocess the data efficiently so that matplotlib can render it with minimal overhead, instead of having matplotlib trying to figure out the format itself. You're effectively looking at writing an adapter for your data source that selects, filters and organises data and passes matplotlib exactly what it needs, in response to user actions. – Grismar Dec 01 '22 at 22:22
  • I'm still watching this post out of genuine interest. Agreeing with Grismar, let me add that, if you only send say a couple of hundred points at a time to matplotlib, there is no noticeable time spent to plotting - even 10000 points take less than a second for me (say you visualise 10 tag trends with 1000 points each). So my advise is to react on the events and replot the data that is needed for given xlimits. – Dr. V Dec 02 '22 at 21:54
  • @Dr. V--I want the plot to refresh as quickly as the user scrolls--so a second per frame is way too long. Though as it is, we have to insert an explicit pause in the scrolling code so it doesn't jump all the way to the beginning/end as soon as the user starts holding down the "scroll left" or "scroll right" button for small data sets. So just sending new axis limits (without sending new arrays) is very fast as long as the arrays are small. I will try also sending new smaller arrays (rather new *slices*, which don't actually require re-allocating anything on the app side) and see how it does. – biohacker Dec 05 '22 at 20:54
  • @biohacker, I know what you mean. Some software I used solved that by initially sending very little data on the scroll event (say 100 points). You then see that the trends are not detailed while scrolling. Then, when you stop scrolling, they catch that and send more data. A bit like a search field that only starts searching after you stop typing for say 1/2 second. – Dr. V Dec 05 '22 at 21:18
  • To those who are following--I tried that (re-sending a new slice on every scroll) and it works beautifully! It's just as responsive now on large data sets as small ones. It's actually easier than trapping events as all the scrolling and zooming commands from the user come through our interface rather than matplotlib anyway. – biohacker Dec 08 '22 at 23:23

0 Answers0