1

I've written a python program that takes "x" large (10GB~) csv files, chunks them, and after some calculations, does a matplotlib.pyplot.hist (plt.hist) on the values. The results of the histogram are saved in a list. The 40 different files are being analyzed in parallel using concurrent.futures, with 10 concurrent max_workers.

While testing on some smaller data sets, I received a runtime error on :


  File "D:\redacted.py", line 66, in Test
    HistData1, bla1, bla2 =  plt.hist(HistogramData, bins=binedges)

Where HistogramData is a list of integers between -90 and +90. The rest of the traceback after this point:


  File "C:\redacted\anaconda3\lib\site-packages\matplotlib\pyplot.py", line 2685, in hist
    return gca().hist(

  File "C:\redacted\anaconda3\lib\site-packages\matplotlib\pyplot.py", line 2368, in gca
    return gcf().gca(**kwargs)

  File "C:\redacted\anaconda3\lib\site-packages\matplotlib\figure.py", line 2065, in gca
    return self.add_subplot(1, 1, 1, **kwargs)

  File "C:\redacted\anaconda3\lib\site-packages\matplotlib\figure.py", line 1404, in add_subplot
    return self._add_axes_internal(key, ax)

  File "C:\redacted\anaconda3\lib\site-packages\matplotlib\figure.py", line 1408, in _add_axes_internal
    self._axstack.add(key, ax)

  File "C:\redacted\anaconda3\lib\site-packages\matplotlib\figure.py", line 125, in add
    super().remove((key, a_existing))
ValueError: Given element not contained in the stack

This error only occurs on one of the chunks, and it's not the first.

I can't seem to find any information on this error in the matplotlib documentation, on stack exchange or with google in general. Does anyone have any useful insight as to what this error is signaling? e.g. could it be related to multiple histograms possibly being done in parallel on the same plt.axis?

MWE(ish, the individual parts work, but when combined and with specific data give the above error) :

from concurrent import futures
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def FindPairs(data):

    (Do stuff to Data)
    
    (Do more stuff to Data to turn it into a list of integers called HistogramData)
    
    HistData1, bla1, bla2 =  plt.hist(HistogramData, bins=binedges)
    return HistData1


def Function(File_name):
    (Set variables like chunksize here)

    Data = []
    for chunk in pd.read_csv(File_name, chunksize=chunksize):
        t = FindPairs(chunk)
        Data.append(t)
    Data = sum(Data) #To sum the different histograms from each chunk into one histogram for each file
    return Data


ls = [file for file in os.listdir()]
Data = [None] * len(ls)
ex = futures.ThreadPoolExecutor(max_workers = 10)
Data = ex.map(Function, ls)
Data = list(Data)
Hunted
  • 88
  • 1
  • 7
  • 1
    Possibly this is what you're running into: https://matplotlib.org/stable/faq/howto_faq.html#work-with-threads There's another question about it here: https://stackoverflow.com/questions/34764535/why-cant-matplotlib-plot-in-a-different-thread which points the OP to multiprocessing. You could try `concurrent.futures.ProcessPoolExecutor` instead. – mechanical_meat Aug 28 '21 at 23:31
  • It seems very likely to me that would be the problem. I had no clue matplotlib wasn't safe to thread, and I'm quite surprised by it. Thank you! – Hunted Aug 29 '21 at 23:40

0 Answers0