extract borderless table with pdfplumber

Question

I am trying to extract the borderless tables from the PDF document, I have tried few combination with PDF table_settings parameter, however pdfplumber cannot recognize the borderless tables correctly

pdf file can be downloaded from the link

Here is my code

import pdfplumber
pdf_file="pdffile"
with pdfplumber.open(pdf_file) as pdf:
    for i in range(0,len(pdf.pages)):
        try:
           if i==7:
               bold_title_text=pdf.pages[i]
               ff=bold_title_text.extract_table(table_settings=
                                                    {"vertical_strategy": "text", 
                                                     "horizontal_strategy": "lines",
                                                     "keep_blank_chars": "True",                                                                                                                          
                                                     "snap_tolerance": 4,
                                                   })
            display(ff[1])
       except IndexError:
           print("")
           break

output ['Element','nt Attribute Size Input Type Requirement']

Expected Output ['Element', 'Attribute', 'Size', 'Input Type', 'Requirement']

You can try camelot with `stream` flavor: https://camelot-py.readthedocs.io/en/master/user/how-it-works.html#stream — Stefano Fiorucci - anakin87, Jul 07 '22 at 10:48

score 0 · Answer 1 · answered Sep 02 '23 at 12:04

For tables that have no vertical line separators, you can

Crop the table part first
1. Use the "text" strategy like you have in your question. Without the crop, it doesn't work well because the non-table text interferes with the table extraction logic.
2. Use the "explicit" strategy for the vertical lines and specify the X-coordinates for the vertical lines.
Use the "explicit" strategy for the vertical lines and specify the X-coordinates for the vertical lines. Since without cropping, have a post-processing logic to filter out the non-table data.

Here is an example for the explicit lines that works with the table you've shared

import pdfplumber
pdf = pdfplumber.open("file.pdf")
page = pdf.pages[6]
tables = p.extract_tables(table_settings={
    "vertical_strategy": "explicit",
    "horizontal_strategy": "lines",
    "explicit_vertical_lines": [90, 200, 250, 320, 440, 510],
})
for table in tables:
    print()
    for row in table:
        print(row)

With this your table output becomes

['Element', 'Attribute', 'Size', 'Input Type', 'Requirement']
['TransmittingCountry', '', '2-character', 'iso:CountryCode_Type', 'Validation']

extract borderless table with pdfplumber

1 Answers1