1

i've this pdf, I'm trying to extract table from pdf. Wwhat is the better strategy to get the table? I can not be able to get the value specific on table , for example in the first table , i 've to get [70,75,80,85,90,100,105,110,115,120] and for the second line [0,0,2,6,10,10,10,2,2,0,0]

My final result would be : 411924,KGDHN,MBELT W 40 INT, T.GG SUPREME/SELLERIA, 9643 BEIGE EBONY/COCOA, [70,75,80,85,90,100,105,110,115,120] ,[0,0,2,6,10,10,10,2,2,0,0],42,200.00,8,400.00

enter image description here

with pdfplumber.open(doc) as pdf:
print(pdf.pages)
page = pdf.pages[0]
im = page.to_image(resolution = 400)
text = page.extract_words()
im = im.draw_rects(page.extract_words())
im.show()
# h = open('empty_test' + '.json', "w")
# json.dump(text, h, indent=2, sort_keys=False)
# h.close()

enter image description here

It is a PDF with text. I can extract the text easily, and keep the layout almost the same

for page in pdf.pages:
    for line in page.extract_text(keep_blank_chars=False, layout=True).splitlines():
        print(line)

enter image description here

Herojos
  • 61
  • 4
  • the null is a no value too? If no zero , maybe it is possible to get a correspondent null value ? – Herojos Jun 29 '23 at 21:53
  • If you can isolate the yellow boxes you can split them in half, if there is nothing in the bottom half - assume null/zero. If you draw the `.rects` of the page - are they picked up? Or perhaps you can detect the areas by stroking_color. – jqurious Jun 30 '23 at 17:13
  • @jqurious How can I get the .rects draw? – Herojos Jun 30 '23 at 22:02
  • https://github.com/jsvine/pdfplumber#creating-a-pageimage-with-to_image `im.draw_rects()` and `page.rects` - there's a guide in the docs with `.curves` but should help: https://nbviewer.org/github/jsvine/pdfplumber/blob/stable/examples/notebooks/ag-energy-roundup-curves.ipynb – jqurious Jun 30 '23 at 22:19
  • @jqurious, perfect, I've update my questions whith the result that you ask – Herojos Jun 30 '23 at 22:28
  • Is this a PDF with text, or is this just a PDF of scanned images? The latter is infinitely more difficult to extract. – Tim Roberts Jun 30 '23 at 22:28
  • One possible approach is to use `page.search(r'\d+')` to find the numbers and use their `x0` position as `explicit_vertical_lines=` to build the table columns - you could crop out the Description/Size "column" first to narrow things down. There's many examples on the pdfplumber discussions e.g. https://github.com/jsvine/pdfplumber/discussions/907 - if the PDF is public and you can post it, that would help. – jqurious Jun 30 '23 at 22:37
  • @jqurious, i've deleted privacy inormation , and the structure is persistan . At this url: https://uploadnow.io/f/0NcV9jd you can get the file. Thanks for your help in advances – Herojos Jun 30 '23 at 22:54

2 Answers2

0

The idea is to isolate the smallest area around the values via cropping:

enter image description here

You can then use the x0 position of each word as your vertical line.

enter image description here

You can pass the lines to table settings via explicit_vertical_lines which will give back empty strings for the "blank" cells.

col1.extract_text()='406831 DJ20N\n1000 NERO'
MBELT W.40 GG MAR DOLLAR PIGPRINT
['60', '65', '70', '75', '80', '85', '90', '95', '100', '105', '110', '115', '120']
['', '', '0', '0', '2', '6', '10', '10', '10', '2', '2', '', '']
col3.extract_text()='42 218.00 9,15\n9,156.0'

col1.extract_text()='414516 0YA0G\n1000 BLACK'
MBELT W.30 GG MAR. PLUTONE CALF
['65', '70', '75', '80', '85', '90', '95', '100', '105', '110', '115', '120', '135']
['0', '0', '0', '2', '6', '15', '15', '15', '2', '2', '0', '0', '']
col3.extract_text()='57 205.00 11,6\n11,685.0'

col1.extract_text()='406831 0YA0G\n1000 BLACK'
MBELT W.40 GG MAR PLUTONE CALF
['60', '65', '70', '75', '80', '85', '90', '95', '100', '105', '110', '115', '120']
['', '', '', '', '', '2', '2', '2', '2', '2', '1', '', '']
col3.extract_text()='11 218.00 2,39\n2,398.0'

col1.extract_text()='627055 92TIN\n9769 B.EBONY/NERO'
MBELT W.37GG M.R T.GG SUPREME/PLUTONE CALF
['60', '65', '70', '75', '80', '85', '90', '95', '100', '105', '110', '115', '120']
['', '', '', '', '', '3', '3', '3', '3', '3', '1', '', '']
col3.extract_text()='16 244.00 3,90\n3,904.0'

There are various ways you could approach this, but the steps I've used here are:

  • use Description header line to identify the "top" of the rows if present
  • split into rows based on the thick/dark horizontal lines
  • split each row into columns based on vertical lines within row
  • use the method above for creating a table from column 2

These are the horizontal lines we can use to split into rows:

enter image description here

These are the vertical lines we can use to divide the columns:

enter image description here

import itertools

# First thick horizontal line that span > width % of page
product_line = next(
    line for line in page.horizontal_edges 
    if  line['orientation'] == 'h'
    and line['linewidth'] > 1 
    and line['width'] > page.width / 1.25
)

# Search for Description 
description = page.search('Description/Size Quantity Qty Price Value')
has_description = len(description) > 0

# If there is a description we crop there, else we use the line divider
if has_description:
    description = description[0]
    product_area_top = description['bottom'] + 10
    product_area = page.crop(
        (product_line['x0'], product_area_top, product_line['x1'], page.height)
    )
else:
    product_area_top = product_line['top']
    product_area = page.crop(
        (product_line['x0'], product_area_top, product_line['x1'], page.height)
    )

# find horizontal lines for rows
hlines = [
    line['top'] for line in product_area.edges 
    if  line['orientation'] == 'h'
    and line['stroking_color'] == (0, 0, 0) 
    and line['width'] > product_area.width / 1.25
]

# If there is no description on the page we need to add in the top as the first line (in order to extract row 1)
if has_description is False:
    hlines = [product_area_top] + hlines

# Make sure our lines are sorted from top -> bottom
hlines = sorted(set(hlines))

for top, bottom in itertools.pairwise(hlines):
    row = product_area.crop(
        (product_area.bbox[0], top, product_area.width, bottom)
    )

    # vertical lines to create columns
    vlines = [
       line['x0'] for line in row.vertical_edges 
       if line['object_type'] == 'line'
    ]

    # we need to add an end line to extract last column
    vlines = sorted(vlines + [row.width])

    col1 = row.crop((vlines[0], top, vlines[1], bottom))
    col2 = row.crop((vlines[1], top, vlines[2], bottom))
    col3 = row.crop((vlines[2], top, vlines[3], bottom))
    
    lines = col2.extract_text_lines()

    # lines 1-2 are the values, use their positions to crop 
    bbox = lines[1]['x0'], lines[1]['top'], lines[-1]['x1'], lines[-2]['bottom']

    values = col2.crop(bbox)

    # use start of each word as a vertical line edge
    vlines = [word['x0'] for word in values.extract_words()]

    table = values.extract_table(dict(
       explicit_vertical_lines = vlines
    ))

    print(f'{col1.extract_text()=}')
    print(lines[0]['text'], table[0], table[1], sep='\n')
    print(f'{col3.extract_text()=}')

You may be able to extract the other values simply from the text extraction methods, or you could use a similar cropping technique on col1, col3.

jqurious
  • 9,953
  • 1
  • 4
  • 14
  • Wonderfull, you opened up a world to me. I am not familiar with the use of this library, but I find that it is very powerful... if you know how to use it. Currently I have some difficulty in understanding the strategy well. I have to study your code which works perfectly. However I found that the following pages don't have a fixed reference to crop with the boudbox coordinates. Where can I fix the coordinates . See the page turn in this image https://uploadnow.io/f/JjNftrB – Herojos Jul 01 '23 at 10:26
  • If there is no description on the page you could fallback to using the first horizontal divider line. Or if there is always some other pattern present e.g. the date: you could `page.search(r'\d{2}/\d{2}/\d{4}')` and use the bottom position of the first date match perhaps. – jqurious Jul 01 '23 at 11:16
  • Hi, I'm trying to implement your magniful solution to recover pages without a fixed header as well . If use thin desc = page.search(r'P.P.O. N°:.+\d{2}/\d{2}/\d{4}')[0] qty = page.search(r'77-NUGNES 1920 S.R.L.')[0], i can be able to intercet the bbox crop col2 = products.crop((desc['x0']-50, desc['top'], qty['x0'], products.bbox[-1])) i get thi error... raise ValueError(f"{bbox} has a negative width or height.") ValueError: (131.0914, 366.4936, 458.6376375, 358.6801) has a negative width or height. – Herojos Jul 01 '23 at 14:57
  • Maybe if you can post another sample PDF with no header I can try to see what the problem is. – jqurious Jul 01 '23 at 17:20
  • https://uploadnow.io/f/d60pyWT . I don't know how to thank you. – Herojos Jul 01 '23 at 18:06
  • I see what you mean - the problem was no way to identify the columns. I've changed the code to use the vertical column lines in the page instead of using the description text. It works on both pages of the latest sample PDF. – jqurious Jul 01 '23 at 20:02
  • It works perfectly. I will study your setup and strategy. Really congratulations, you should publish a specific guide on how to build the library and above all the transformation strategy. I just have to retrieve the other infomrazioni. I hope it's simpler.. Thanks again, – Herojos Jul 01 '23 at 20:36
  • I've updated the code to instead crop the columns inside the loop. This makes more sense and should give you easier access to the other values. – jqurious Jul 01 '23 at 21:23
0

Here's another possible approach which may be simpler.

You can match just those specific numbers and save the x0 position so you have the full "width".

enter image description here

You can then create a full "blank" row which you can merge with each row to ensure each row gets the same number of columns.

import itertools
import pdfplumber
import re
from   operator import itemgetter

pdf  = ...
page = ...

nums = [ 
   word for word in page.extract_words(extra_attrs=['size', 'fontname']) 
   if  re.fullmatch('\d+', word['text']) 
   and word['size'] == 6.75 
   and word['fontname'].endswith('Bold') is False 
]

# remove duplicates based on x0
cols = {}
for num in nums:
    cols[num['x0']] = num
 
# replace text with blank   
blanks = { 
   col['x0']: col | {'text': ''} for col in cols.values() 
}

# use 'top' position to group into rows
for x0, row in itertools.groupby(nums, key=itemgetter('top')):
   row = blanks | { col['x0']: col for col in row }
   print([col['text'] for col in row.values()])
['70', '75', '80', '85', '90', '95', '100', '105', '110', '115', '120', '', '']
['', '', '2', '6', '10', '10', '10', '2', '2', '', '', '', '']
['60', '65', '70', '75', '80', '85', '90', '95', '100', '105', '110', '115', '120']
['', '', '0', '0', '2', '6', '10', '10', '10', '2', '2', '', '']
['65', '70', '75', '80', '85', '90', '95', '100', '105', '110', '115', '120', '135']
['0', '0', '0', '2', '6', '15', '15', '15', '2', '2', '0', '0', '']
['60', '65', '70', '75', '80', '85', '90', '95', '100', '105', '110', '115', '120']
['', '', '', '', '', '2', '2', '2', '2', '2', '1', '', '']
['60', '65', '70', '75', '80', '85', '90', '95', '100', '105', '110', '115', '120']
['', '', '', '', '', '3', '3', '3', '3', '3', '1', '', '']
jqurious
  • 9,953
  • 1
  • 4
  • 14