A PDF just contains instructions to place a character at an x,y coordinate on a 2-D plane, retaining no knowledge of words, sentences or tables.
Camelot uses PDFMiner under the hood to group characters into words and words into sentences. Sometimes when the characters are too close, PDFMiner can group characters belonging to different words into a single one.
Since the characters in your PDF table are placed very close, they are being merged into a single word and hence Camelot isn't able to detect the columns correctly. You can specify column separators to get the table out in this case. To get the x-coordinates of column separators you can check out the visual debugging guide. Additionally, you can specify split_text=True
to cut the word along the column separators you've specified. Here's the code (I got the x-coordinates by creating a matplotlib plot of the text in the PDF using $ camelot stream -plot text m27.pdf
):
Using CLI:
$ camelot --output m27.csv --format csv -split stream -C 72,95,209,327,442,529,566,606,683 m27.pdf
Using API:
>>> import camelot
>>> tables = camelot.read_pdf('m27.pdf', flavor='stream', columns=['72,95,209,327,442,529,566,606,683'], split_text=True)