How to extract text from two column pdf with Python?

Question

I have :

I have a PDF which are in two-column format.Is there a way to read each PDF according to the two-column format without cropping each PDF individually?

What are your results so far? Apparently the pdf is in text format (NLP), not image (OCR). — Mika72, Mar 11 '19 at 10:49

fitz · Answer 1 · 2021-12-01T04:14:11.210

I found an alternative method, you can crop the pdf with two part, left and right, then merge left content and right content for every page, you can try this:

# https://github.com/jsvine/pdfplumber

import pdfplumber


x0 = 0    # Distance of left side of character from left side of page.
x1 = 0.5  # Distance of right side of character from left side of page.
y0 = 0  # Distance of bottom of character from bottom of page.
y1 = 1  # Distance of top of character from bottom of page.

all_content = []
with pdfplumber.open("file_path") as pdf:
    for i, page in enumerate(pdf.pages):
        width = page.width
        height = page.height

        # Crop pages
        left_bbox = (x0*float(width), y0*float(height), x1*float(width), y1*float(height))
        page_crop = page.crop(bbox=left_bbox)
        left_text = page_crop.extract_text()

        left_bbox = (0.5*float(width), y0*float(height), 1*float(width), y1*float(height))
        page_crop = page.crop(bbox=left_bbox)
        right_text = page_crop.extract_text()
        page_context = '\n'.join([left_text, right_text])
        all_content.append(page_context)
        if i < 2:  # help you see the merged first two pages
            print(page_context)

Some pages may or may not have text spit into columns. How can I write an if- statement based on this? @fitz — StressedBoi69420, Nov 30 '21 at 09:40
@StressedBoi69420 Do you mean the statement "if i < 2"? It is used for seeing the merged first two pages. Or you can provide more info — fitz, Dec 01 '21 at 04:16
No, but that's useful to note. I'm talking about when to apply column extraction; **conditionally**. I've made a post about it here: https://stackoverflow.com/q/70170544/16105404 — StressedBoi69420, Dec 01 '21 at 09:06

score 0 · Answer 2 · answered Mar 11 '19 at 16:00

This is code I use for regular pdf parsing, and it seems to work ok on that image (I downloaded an image, so this uses Optical Character Recognition, so its as accurate as regular OCR). Note that this tokenizes the text. Also note that you need to install tesseract for this to work (pytesseract just makes tesseract work from python). Tesseract is free & open source.

from PIL import Image
import pytesseract
import cv2
import os

def parse(image_path, threshold=False, blur=False):
    image = cv2.imread(image_path)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    if threshold:
        gray = cv2.threshold(gray, 0, 255, \
            cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
    if blur: #useful if salt-and-pepper background.
        gray = cv2.medianBlur(gray, 3)
    filename = "{}.png".format(os.getpid())
    cv2.imwrite(filename, gray) #Create a temp file
    text = pytesseract.image_to_string(Image.open(filename))
    os.remove(filename) #Remove the temp file
    text = text.split() #PROCESS HERE.
    print(text)
a = parse(image_path, True, False)

Also I may have borrowed that code from someone else a while back, I don't actually recall if that specific snippit is mine or someone elses. — Evan Mata, Mar 11 '19 at 16:01

How to extract text from two column pdf with Python?

2 Answers2

Linked