I've written a python program that takes "x" large (10GB~) csv files, chunks them, and after some calculations, does a matplotlib.pyplot.hist (plt.hist) on the values. The results of the histogram are saved in a list. The 40 different files are being analyzed in parallel using concurrent.futures, with 10 concurrent max_workers.
While testing on some smaller data sets, I received a runtime error on :
File "D:\redacted.py", line 66, in Test
HistData1, bla1, bla2 = plt.hist(HistogramData, bins=binedges)
Where HistogramData is a list of integers between -90 and +90. The rest of the traceback after this point:
File "C:\redacted\anaconda3\lib\site-packages\matplotlib\pyplot.py", line 2685, in hist
return gca().hist(
File "C:\redacted\anaconda3\lib\site-packages\matplotlib\pyplot.py", line 2368, in gca
return gcf().gca(**kwargs)
File "C:\redacted\anaconda3\lib\site-packages\matplotlib\figure.py", line 2065, in gca
return self.add_subplot(1, 1, 1, **kwargs)
File "C:\redacted\anaconda3\lib\site-packages\matplotlib\figure.py", line 1404, in add_subplot
return self._add_axes_internal(key, ax)
File "C:\redacted\anaconda3\lib\site-packages\matplotlib\figure.py", line 1408, in _add_axes_internal
self._axstack.add(key, ax)
File "C:\redacted\anaconda3\lib\site-packages\matplotlib\figure.py", line 125, in add
super().remove((key, a_existing))
ValueError: Given element not contained in the stack
This error only occurs on one of the chunks, and it's not the first.
I can't seem to find any information on this error in the matplotlib documentation, on stack exchange or with google in general. Does anyone have any useful insight as to what this error is signaling? e.g. could it be related to multiple histograms possibly being done in parallel on the same plt.axis?
MWE(ish, the individual parts work, but when combined and with specific data give the above error) :
from concurrent import futures
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def FindPairs(data):
(Do stuff to Data)
(Do more stuff to Data to turn it into a list of integers called HistogramData)
HistData1, bla1, bla2 = plt.hist(HistogramData, bins=binedges)
return HistData1
def Function(File_name):
(Set variables like chunksize here)
Data = []
for chunk in pd.read_csv(File_name, chunksize=chunksize):
t = FindPairs(chunk)
Data.append(t)
Data = sum(Data) #To sum the different histograms from each chunk into one histogram for each file
return Data
ls = [file for file in os.listdir()]
Data = [None] * len(ls)
ex = futures.ThreadPoolExecutor(max_workers = 10)
Data = ex.map(Function, ls)
Data = list(Data)