3

i want to draw a fairly small IoT-CSV-Dataset, about ~2gb. It has the following dimensions (~20.000, ~18.000). Each column should become a subplot, with it's own y axis. I use the following code to generate the picture:

times = pd.date_range('2012-10-01', periods=2000, freq='2min')
timeseries_array = np.array(times);
cols = random.sample(range(1, 2001), 2000)
values = []
for col in cols:
    values.append(random.sample(range(1,2001), 2000))

time = pd.DataFrame(data=timeseries_array, columns=['date'])
graph = pd.DataFrame(data=values, columns=cols, index=timeseries_array)

fig, axarr = plt.subplots(len(graph.columns), sharex=True, sharey=True, 
constrained_layout=True, figsize=(50,50))
fig.autofmt_xdate()

for i, ax in enumerate(axarr):
    ax.plot(time['date'], graph[graph.columns[i]].values)
    ax.set(ylabel=graph.columns[i])
    ax.spines['right'].set_visible(False)
    ax.spines['top'].set_visible(False)
    myFmt = mdates.DateFormatter('%d.%m.%Y %H:%M')
    ax.xaxis.set_major_formatter(myFmt)
    ax.label_outer()

print('--save-fig--')
plt.savefig(name, dpi=500)
plt.close()

But this is so incredible slow, for 100 subplots it took ~1 min, for 2000 around 20 min. Well my machine has 10 cores and 35 gb ram actually. Have you any hints for me to speed up the process? Is it possible to do multithreading? As i can see this only use one core. Are there some tricks to only draw relevant things? Or is there an alternative method to draw this plot faster, all in one figure without subplots?

lkaupp
  • 551
  • 1
  • 6
  • 17
  • I wouldn't recommend using so many subplots, its likely taking forever to evaluate good dimensions (i.e. no collisions with the next plot) for each. Either try following [this answer](https://stackoverflow.com/a/13060980/565489) or maybe save each plot individually and then join the figures afterwards? – Asmus Apr 24 '19 at 11:15
  • Can you go more into detail about what the purpose of drawing 18000 subplots would be? Consider that if you have a 50 inch figure with a dpi of 100, each subplot will be 5000/18000 < 1 pixel large == not even visible. – ImportanceOfBeingErnest Apr 24 '19 at 11:55
  • i try to visualize the IoT-data to detect peaks by hand to verify my recordings. – lkaupp Apr 24 '19 at 12:50
  • 1
    Also. Never use constrained_layout on so many axes. It’s solving a linear constraint problem and having a lot of axes makes that problem huge. – Jody Klymak Apr 24 '19 at 13:49
  • yeah thanks to Asmus i came up with my own solution to this problem. i thought this can't be a huge dealbreaker, but after all the code digging it seems a bit unusual to print such a high number of columns. @Asmus Thanks again for your comment! – lkaupp Apr 24 '19 at 14:10

1 Answers1

1

Thanks to @Asmus, i came up with this solution, brought me down from 20 mins to 40 secs for (2000,2000). I did not find any good well-documented solution for beginners like me, so i post mine here, used for timeseries and a huge number of columns:

def print_image_fast(name="default.png", graph=[]):
    int_columns = len(graph.columns)
    #enlarge our figure for every 1000 columns by 30 inch, function well with 500 dpi labelsize 2 and linewidth 0.1
    y_size = (int_columns / 1000) * 30
    fig = plt.figure(figsize=(10, y_size))
    ax = fig.add_subplot(1, 1, 1)
    #set_time_formatter for timeseries
    myFmt = mdates.DateFormatter('%d.%m.%Y %H:%M')
    ax.xaxis.set_major_formatter(myFmt)
    #store the label offsets
    y_label_offsets = []
    current = 0
    for i, col in enumerate(graph.columns):
        #last max height of the column before
        last = current
        #current max value of the column and therefore the max height on y
        current = np.amax(graph[col].values)


        if i == 0:
            #y_offset to move the graph along the y axis, starting with column 0 the offset is 0
            y_offset = 0
        else:
            #add the last y_offset (aggregated y_offset from the columns before) + the last offset + 1 is our new Y - zero point to start drawing the new graph
            y_offset = y_offset + last + 1

        #our label offset is always our current y_offset + half of our height (half of current max value)
        y_offset_label = y_offset + (current / 2)
        #append label position to array
        y_label_offsets.append(y_offset_label)
        #plot our graph according to our offset
        ax.plot(graph.index.values, graph[col].values + y_offset,
                'r-o', ms=0.1, mew=0, mfc='r', linewidth=0.1)

    #set boundaries of our chart, last y_offset + full current is our limit for our y-value
    ax.set_ylim([0, y_offset+current])
    #set boundaries for our timeseries, first and last value
    ax.set_xlim([graph.index.values[0], graph.index.values[-1]])

    #print columns with computed positions to y axis
    plt.yticks(y_label_offsets, graph.columns, fontsize=2)
    #print our timelabels on x axis
    plt.xticks(fontsize=15, rotation=90)

    plt.savefig(name, dpi=500)
    plt.close()

//Edit: For anybody interested, a dataframe with (20k,20k) polutes my ram with around ~20gb. And i had to change savefig to svg, because Agg can't handle dimensions greater than 2^16 pixels

lkaupp
  • 551
  • 1
  • 6
  • 17