4

I have : enter image description here

I have a PDF which are in two-column format.Is there a way to read each PDF according to the two-column format without cropping each PDF individually?

alfonso
  • 107
  • 1
  • 7

2 Answers2

4

I found an alternative method, you can crop the pdf with two part, left and right, then merge left content and right content for every page, you can try this:

# https://github.com/jsvine/pdfplumber

import pdfplumber


x0 = 0    # Distance of left side of character from left side of page.
x1 = 0.5  # Distance of right side of character from left side of page.
y0 = 0  # Distance of bottom of character from bottom of page.
y1 = 1  # Distance of top of character from bottom of page.

all_content = []
with pdfplumber.open("file_path") as pdf:
    for i, page in enumerate(pdf.pages):
        width = page.width
        height = page.height

        # Crop pages
        left_bbox = (x0*float(width), y0*float(height), x1*float(width), y1*float(height))
        page_crop = page.crop(bbox=left_bbox)
        left_text = page_crop.extract_text()

        left_bbox = (0.5*float(width), y0*float(height), 1*float(width), y1*float(height))
        page_crop = page.crop(bbox=left_bbox)
        right_text = page_crop.extract_text()
        page_context = '\n'.join([left_text, right_text])
        all_content.append(page_context)
        if i < 2:  # help you see the merged first two pages
            print(page_context)

fitz
  • 540
  • 4
  • 11
  • 1
    Some pages may or may not have text spit into columns. How can I write an if- statement based on this? @fitz – StressedBoi69420 Nov 30 '21 at 09:40
  • 1
    @StressedBoi69420 Do you mean the statement "if i < 2"? It is used for seeing the merged first two pages. Or you can provide more info – fitz Dec 01 '21 at 04:16
  • No, but that's useful to note. I'm talking about when to apply column extraction; **conditionally**. I've made a post about it here: https://stackoverflow.com/q/70170544/16105404 – StressedBoi69420 Dec 01 '21 at 09:06
0

This is code I use for regular pdf parsing, and it seems to work ok on that image (I downloaded an image, so this uses Optical Character Recognition, so its as accurate as regular OCR). Note that this tokenizes the text. Also note that you need to install tesseract for this to work (pytesseract just makes tesseract work from python). Tesseract is free & open source.

from PIL import Image
import pytesseract
import cv2
import os

def parse(image_path, threshold=False, blur=False):
    image = cv2.imread(image_path)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    if threshold:
        gray = cv2.threshold(gray, 0, 255, \
            cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
    if blur: #useful if salt-and-pepper background.
        gray = cv2.medianBlur(gray, 3)
    filename = "{}.png".format(os.getpid())
    cv2.imwrite(filename, gray) #Create a temp file
    text = pytesseract.image_to_string(Image.open(filename))
    os.remove(filename) #Remove the temp file
    text = text.split() #PROCESS HERE.
    print(text)
a = parse(image_path, True, False)
Evan Mata
  • 500
  • 1
  • 6
  • 19