2

With the pdfplumber library, you can extract the text of a PDF page, or you can extract the tables from a pdf page.

The issue is that I can't seem to find a way to extract text and tables. Essentially, if the pdf is formatted in this way:

text1
tablename
___________
| Header 1 |
------------
| row 1    |
------------

text 2

I would like the output to be:

["text 1",
 "table name",
 [["header 1"], ["row 1"]],
 "text 2"]

In this example you could run extract_text from pdfplumber:

with pdfplumber.open("example.pdf") as pdf:
    for page in pdf.pages:
        page.extract_text()

but that extracts text and tables as text. You could run extract_tables, but that only gives you the tables. I need a way to extract both text and tables at the same time.

Is this built into the library some way that I don't understand? If not, is this possible?

Edit: Answered

This comes directly from the accepted answer with a slight tweak to fix it. Thanks so much!

from operations import itemgetter

def check_bboxes(word, table_bbox):
    """
    Check whether word is inside a table bbox.
    """
    l = word['x0'], word['top'], word['x1'], word['bottom']
    r = table_bbox
    return l[0] > r[0] and l[1] > r[1] and l[2] < r[2] and l[3] < r[3]


tables = page.find_tables()
table_bboxes = [i.bbox for i in tables]
tables = [{'table': i.extract(), 'top': i.bbox[1]} for i in tables]
non_table_words = [word for word in page.extract_words() if not any(
    [check_bboxes(word, table_bbox) for table_bbox in table_bboxes])]
lines = []
for cluster in pdfplumber.utils.cluster_objects(
        non_table_words + tables, itemgetter('top'), tolerance=5):
    if 'text' in cluster[0]:
        lines.append(' '.join([i['text'] for i in cluster]))
    elif 'table' in cluster[0]:
        lines.append(cluster[0]['table'])

Edit July 19th 2022:

Updated a param to include itemgetter, which is now required for pdfplumber's cluster_objects function (rather than a string)

Justin Furuness
  • 685
  • 8
  • 21
  • 1
    For anyone coming here in the future, it's worth noting that this method only works well when there is no text to the left and right of the tables. – Justin Furuness Apr 22 '22 at 04:49

1 Answers1

1

You can get tables' bounding boxes and then filter out all of the words inside them, something like this:

def check_bboxes(word, table_bbox):
    """
    Check whether word is inside a table bbox.
    """
    l = word['x0'], word['top'], word['x1'], word['bottom']
    r = table_bbox
    return l[0] > r[0] and l[1] > r[1] and l[2] < r[2] and l[3] < r[3]


tables = page.find_tables()
table_bboxes = [i.bbox for i in tables]
tables = [{'table': i.extract(), 'doctop': i.bbox[1]} for i in tables]
non_table_words = [word for word in page.extract_words() if not any(
    [check_bboxes(word, table_bbox) for table_bbox in table_bboxes])]
lines = []
for cluster in pdfplumber.utils.cluster_objects(non_table_words+tables, 'doctop', tolerance=5):
    if 'text' in cluster[0]:
        lines.append(' '.join([i['text'] for i in cluster]))
    elif 'table' in cluster[0]:
        lines.append(cluster[0]['table'])
hellpanderr
  • 5,581
  • 3
  • 33
  • 43
  • Thank you for your answer, unfortunately it's not quite what I was looking for. This just removes all words that are on tables, and does not return lines of text with tables at the same time, in the same list, in order (as my example shows) – Justin Furuness Apr 05 '22 at 17:07
  • I just tried it, this answer appears to put all tables first, no matter what (rather than in order with the text). If I can figure out how to make them appear in order I'll accept the answer. – Justin Furuness Apr 07 '22 at 17:03
  • I think I figured it out, it's because you were using the doctop attribute of the text, but the top attribute of the tables, which don't match up. I'll update my question with the correct answer, then accept your answer. Thank you so much, this is amazing! – Justin Furuness Apr 07 '22 at 17:11