0

I am working on a project with the goal of extracting structured data from a series of tables captured in images.

I have achieved some success adapting the process outlined in this extremely helpful medium post.

As best I understand, this program works by creating a contour mask, of sorts, to outline the borders of a table. Here is the relevant code performing that function:

    #Load image as numpy array
    img = np.array(img) 

    #Threshold image to binary image
    thresh,img_bin = cv2.threshold(img,128,255,cv2.THRESH_BINARY |cv2.THRESH_OTSU)

    #inverting the image 
    img_bin = 255-img_bin
    

    # Length(width) of kernel as 100th of total width
    kernel_len = np.array(img).shape[1]//100
    
    # Defining a vertical kernel to detect all vertical lines of image 
    ver_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, kernel_len))

    # Defining a horizontal kernel to detect all horizontal lines of image
    hor_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_len, 1))

    # A kernel of 2x2
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))

    #Use vertical kernel to detect and save the vertical lines in a jpg
    image_1 = cv2.erode(img_bin, ver_kernel, iterations=3)
    vertical_lines = cv2.dilate(image_1, ver_kernel, iterations=3)
  
    #Use horizontal kernel to detect and save the horizontal lines in a jpg
    image_2 = cv2.erode(img_bin, hor_kernel, iterations=3)
    horizontal_lines = cv2.dilate(image_2, hor_kernel, iterations=3)

    # Combine horizontal and vertical lines in a new third image, with both having same weight.
    img_vh = cv2.addWeighted(vertical_lines, 0.5, horizontal_lines, 0.5, 0.0)

    #Eroding and thesholding the image
    img_vh = cv2.erode(~img_vh, kernel, iterations=2)
    thresh, img_vh = cv2.threshold(img_vh,128,255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)

This process produces numpy array that can be interpreted as an image like this:

enter image description here

From there, the program can identify the table cells outlined on four sides by the contour mask.

Unfortunately, many of the tables that I seek to process, including the one above lack perfect border formatting. The left-most column above lacks a left border (there is still data inside it). Other tables I have lack internal borders at all, relying on white space to format the data for the human eye.

enter image description here

As best I can tell, my path forward here is to add the missing contour lines myself using some kind of logic based on visual elements on the page. In the first example, I could attempt to add a left-side vertical line to the contour mask based on the position of the other contours. In the second example, I could try to add table borders based on consistencies in the position of the text.

That being said, this strategy would require a significant amount of logic, and may not be flexible enough to deal with the various table formats I may come into contact with.

Am I approaching this challenge with the right strategy? Is there a deployable software solution that I am not seeing? Ideally, I'd like this to be as automated as possible.

Any help would be greatly appreciated!

abrezey
  • 135
  • 9
  • 1
    Threshold so you have white and black. Then use np.count_nonzero() for each row. If the count is the width of the image, then you have found the gaps and the center of the gaps are the row separators. (Or just use the first row of the gap for the separator). Similarly, do the same along the columns. – fmw42 Mar 31 '21 at 15:45
  • cv2.HoughLinesP() is better to detect table grid. https://stackoverflow.com/questions/45322630/how-to-detect-lines-in-opencv – Soichi Jan 16 '22 at 02:09

0 Answers0