I think what is needed here is a consistent way of outputting a table to a pdf file amongst graphs output to pdf.
My first thought is not to use the matplotlib backend i.e.
from matplotlib.backends.backend_pdf import PdfPages
because it seemed somewhat limited in formatting options and leaned towards formatting the table as an image (thus rendering the text of the table in a non-selectable format)
If you want to mix dataframe output and matplotlib plots in a pdf without using the matplotlib pdf backend, I can think of two ways.
- Generate your pdf of matplotlib figures as before and then insert pages containing the dataframe table afterwards. I view this as a difficult option.
- Use a different library to generate the pdf. I illustrate one option to do this below.
First, install xhtml2pdf
library. This seems a little patchily supported, but is active on Github and has some basic usage documentation here. You can install it via pip
i.e. pip install xhtml2pdf
Once you've done that, here is a barebones example embedding a matplotlib figure, then the table (all text selectable), then another figure. You can play around with CSS etc to alter the formatting to your exact specifications, but I think this fulfils the brief:
from xhtml2pdf import pisa # this is the module that will do the work
import numpy as np
import pandas as pd
from matplotlib.backends.backend_pdf import PdfPages
import matplotlib.pyplot as plt
# Utility function
def convertHtmlToPdf(sourceHtml, outputFilename):
# open output file for writing (truncated binary)
resultFile = open(outputFilename, "w+b")
# convert HTML to PDF
pisaStatus = pisa.CreatePDF(
sourceHtml, # the HTML to convert
dest=resultFile, # file handle to recieve result
path='.') # this path is needed so relative paths for
# temporary image sources work
# close output file
resultFile.close() # close output file
# return True on success and False on errors
return pisaStatus.err
# Main program
if __name__=='__main__':
arrays = [np.hstack([ ['one']*3, ['two']*3]), ['Dog', 'Bird', 'Cat']*2]
columns = pd.MultiIndex.from_arrays(arrays, names=['foo', 'bar'])
df = pd.DataFrame(np.zeros((3,6)),columns=columns,index=pd.date_range('20000103',periods=3))
# Define your data
sourceHtml = '<html><head>'
# add some table CSS in head
sourceHtml += '''<style>
table, td, th {
border-style: double;
border-width: 3px;
}
td,th {
padding: 5px;
}
</style>'''
sourceHtml += '</head><body>'
#Add a matplotlib figure(s)
plt.plot(range(20))
plt.savefig('tmp1.jpg')
sourceHtml += '\n<p><img src="tmp1.jpg"></p>'
# Add the dataframe
sourceHtml += '\n<p>' + df.to_html() + '</p>'
#Add another matplotlib figure(s)
plt.plot(range(70,100))
plt.savefig('tmp2.jpg')
sourceHtml += '\n<p><img src="tmp2.jpg"></p>'
sourceHtml += '</body></html>'
outputFilename = 'test.pdf'
convertHtmlToPdf(sourceHtml, outputFilename)
Note There seems to be a bug in xhtml2pdf at the time of writing which means that some CSS is not respected. Particularly pertinent to this question is that it seems impossible to get double borders around the table
EDIT
In response comments, it became obvious that some users (well, at least @Keith who both answered and awarded a bounty!) want the table selectable, but definitely on a matplotlib axis. This is somewhat more in keeping with the original method. Hence - here is a method using the pdf
backend for matplotlib and matplotlib objects only. I do not think the table looks as good - in particular the display of hierarchical column headers, but that's a matter of choice, I guess. I'm indebted to this answer and comments for the way to format axes for table display.
import numpy as np
import pandas as pd
from matplotlib.backends.backend_pdf import PdfPages
import matplotlib.pyplot as plt
# Main program
if __name__=='__main__':
pp = PdfPages('Output.pdf')
arrays = [np.hstack([ ['one']*3, ['two']*3]), ['Dog', 'Bird', 'Cat']*2]
columns = pd.MultiIndex.from_arrays(arrays, names=['foo', 'bar'])
df =pd.DataFrame(np.zeros((3,6)),columns=columns,index=pd.date_range('20000103',periods=3))
plt.plot(range(20))
pp.savefig()
plt.close()
# Calculate some sizes for formatting - constants are arbitrary - play around
nrows, ncols = len(df)+1, len(df.columns) + 10
hcell, wcell = 0.3, 1.
hpad, wpad = 0, 0
#put the table on a correctly sized figure
fig=plt.figure(figsize=(ncols*wcell+wpad, nrows*hcell+hpad))
plt.gca().axis('off')
matplotlib_tab = pd.tools.plotting.table(plt.gca(),df, loc='center')
pp.savefig()
plt.close()
#Add another matplotlib figure(s)
plt.plot(range(70,100))
pp.savefig()
plt.close()
pp.close()