3

I'm trying to plot 20 million data points however it's taking an extremely long time (over an hour) using matplotlib,

Is there something in my code that is making this unusually slow?

import csv
import matplotlib.pyplot as plt
import numpy as np
import Tkinter
from Tkinter import *
import tkSimpleDialog
from tkFileDialog import askopenfilename

plt.clf()

root = Tk()
root.withdraw() 
listofparts = askopenfilename()                  # asks user to select file
root.destroy()

my_list1 = []
my_list2 = []
k = 0

csv_file = open(listofparts, 'rb')

for line in open(listofparts, 'rb'):
    current_part1 = line.split(',')[0]
    current_part2 = line.split(',')[1]
    k = k + 1
    if k >= 2:                                   # skips the first line
        my_list1.append(current_part1)
        my_list2.append(current_part2)

csv_file.close()

plt.plot(my_list1 * 10, 'r')
plt.plot(my_list2 * 10, 'g')

plt.show()
plt.close()
Sayse
  • 42,633
  • 14
  • 77
  • 146
darrenvba
  • 201
  • 2
  • 5
  • 21
  • I've removed the parts of your question relating to a library recommendation since those questions are off-topic on [so]. – Sayse Feb 08 '17 at 09:07
  • Have you profiled your code in order to find bottlenecks. On my PC 1 million random data ploting takes fiew seconds, while 2 million and more points lead to the error "In draw_path: Exceeded cell block limit". Also, can suggest PyQtgraph library. – Roman Fursenko Feb 08 '17 at 09:25
  • Thanks. I was getting an "overflow error: Allocated too many blocks" error when I run over 1 million data points, but I fixed this by adding, matplotlib.pyplot.rcParams['agg.path.chunksize'] = 20000. However even running 100,000 data points takes at least 20 minutes. My laptop only have 4GB ram, could the problem be entirely with my laptop? – darrenvba Feb 08 '17 at 09:43
  • Not required for the code, I probably should have took the * 10 out for stackoverflow. Excel has a limit of 1 million cells per row, i created the csv on excel with 1 million data points, but I wanted to test more. It should just repeat the graph line 10 times, it worked when trying smaller data sets. – darrenvba Feb 08 '17 at 10:00

3 Answers3

7

There is no reason whatsoever to have a line plot of 20000000 points in matplotlib.

Let's consider printing first: The maximum figure size in matplotlib is 50 inch. Even having a high-tech plotter with 3600 dpi would give a maximum number of 50*3600 = 180000 points which are resolvable.

For screen applications it's even less: Even a high-tech 4k screen has a limited resolution of 4000 pixels. Even if one uses aliasing effects, there are a maximum of ~3 points per pixel that would still be distinguishable for the human eye. Result: maximum of 12000 points makes sense.

Therefore the question you are asking rather needs to be: How do I subsample my 20000000 data points to a set of points that still produces the same image on paper or screen.

The solution to this strongly depends on the nature of the data. If it is sufficiently smooth, you can just take every nth list entry.

sample = data[::n]

If there are high frequency components which need to be resolved, this would require more sophisticated techniques, which will again depend on how the data looks like.

One such technique might be the one shown in How can I subsample an array according to its density? (Remove frequent values, keep rare ones).

ImportanceOfBeingErnest
  • 321,279
  • 53
  • 665
  • 712
  • "There is no reason whatsoever to have a line plot of 20000000 points in matplotlib." That's quite the assumption. I have about 4,000,000 data points in time-series sensor outputs and I'm making a tool to help identify interesting sections of the data. I'll need to zoom in and see the datapoints at max resolution in those areas. I don't get why matplotlib doesn't have built-in subsampling. I don't care if the overall plot is subsampled, but when I zoom I need the highest resolution. – Joseph Meadows Jun 21 '22 at 14:54
2

The following approach might give you a small improvement. It removes doing the split twice per row (by using Python's CSV library) and also removes the if statement by skipping over the two header lines before doing the loop:

import matplotlib.pyplot as plt
import csv

l1, l2 = [], []

with open('input.csv', 'rb') as f_input:
    csv_input = csv.reader(f_input)

    # Skip two header lines
    next(csv_input)
    next(csv_input)

    for cols in csv_input:
        l1.append(cols[0])
        l2.append(cols[1])

plt.plot(l1, 'r')
plt.plot(l2, 'g')
plt.show()

I would say the main slow down though will still be the plot itself.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97
1

I would recommend switching to pyqtgraph. I switched to it because of speed issues while I was trying to make matplotlib plot real time data. Worked like a charm. Here's my real time plotting example.

Joonatan Samuel
  • 641
  • 4
  • 17