0

As a part of a large QC benchmark I am creating a large number (approx 100K) of scatter plots in a single PDF using PdfPages backend. (See further down for the code)

The issue I am having is that the plotting takes too much time, see output from a custom profiling/debugging effort:

Checkpoint1: Predictions done in 1.110076904296875 millis
Checkpoint2: df created and correlations calculated in 3.108978271484375 millis
Checkpoint3: plotting and accumulating done in 231.31990432739258 millis
Cycle completed in 0.23553895950317383 secs
----------------------
Checkpoint1: Predictions done in 3.718852996826172 millis
Checkpoint2: df created and correlations calculated in 2.353191375732422 millis
Checkpoint3: plotting and accumulating done in 155.93385696411133 millis
Cycle completed in 0.16200590133666992 secs
----------------------
Checkpoint1: Predictions done in 2.920866012573242 millis
Checkpoint2: df created and correlations calculated in 1.995086669921875 millis
Checkpoint3: plotting and accumulating done in 161.8819236755371 millis
Cycle completed in 0.16679787635803223 secs

The figure for plotting gets an 2-3x increase if I annotate the points, which is necessary for the use case. As you can see below I have tried both itertuples() and apply(), switching to apply did not give a significant change in the times as far as I can see.

def annotate(row, ax):
    ax.annotate(row.name, (row.exp, row.model),
                    xytext=(10, 20), textcoords='offset points',
                    arrowprops=dict(arrowstyle="-", connectionstyle="arc,angleA=180,armA=10"),
                    family='sans-serif', fontsize=8, color='darkslategrey')


def plot2File(df, file, seq, z, p, s):
    """ Plot predictions vs experimental """
    plttitle = f"Correlations for {seq}+{z} \n pearson={p} \n spearman={s}"
    ax = df.plot(x='exp', y='model', kind='scatter', title=plttitle, s=40)
    df.apply(annotate, ax=ax, axis=1)
#     for row in df.itertuples():
#         ax.annotate(row.Index, (row.exp, row.model),
#                     xytext=(10, 20), textcoords='offset points',
#                     arrowprops=dict(arrowstyle="-", connectionstyle="arc,angleA=180,armA=10"),
#                     family='sans-serif', fontsize=8, color='darkslategrey')

    plt.savefig(file, bbox_inches='tight', format='pdf')
    plt.close()

Given the nice explanation by Jeff on a question regarding iterrows() I was wondering if it would be possible to vectorize the annotation process? Or should I ditch using a data frame altogether?

Cœur
  • 37,241
  • 25
  • 195
  • 267
posdef
  • 6,498
  • 11
  • 46
  • 94
  • The limiting factor here is the drawing of the annotations. Since all annotations are different, I don't see what vectorizing should do here. When using a dataframe, the same annotations need to drawn at some point, hence it will take the same time. You might parallelize the plotting, using multiprocessing, which when using 4 instead of one core, could speed things up by maximum a factor of 4. – ImportanceOfBeingErnest Sep 20 '17 at 15:42
  • @ImportanceOfBeingErnest I suspected that to be case as well, but thought it wouldnt hurt to ask. I am not sure how I can parallelize the plotting, since the plots are to be written to the same file. Wouldn't GIL mess things up? – posdef Sep 21 '17 at 08:41
  • Possibly, you may try it out. An option may be to create all files separately and merge them to pdf at the end. Maybe [this is a point to start](https://stackoverflow.com/questions/41037840/matplotlib-savefig-performance-saving-multiple-pngs-within-loop). – ImportanceOfBeingErnest Sep 21 '17 at 09:24

0 Answers0