I have a PDF which are in two-column format.Is there a way to read each PDF according to the two-column format without cropping each PDF individually?
Asked
Active
Viewed 5,934 times
2 Answers
4
I found an alternative method, you can crop the pdf with two part, left and right, then merge left content and right content for every page, you can try this:
# https://github.com/jsvine/pdfplumber
import pdfplumber
x0 = 0 # Distance of left side of character from left side of page.
x1 = 0.5 # Distance of right side of character from left side of page.
y0 = 0 # Distance of bottom of character from bottom of page.
y1 = 1 # Distance of top of character from bottom of page.
all_content = []
with pdfplumber.open("file_path") as pdf:
for i, page in enumerate(pdf.pages):
width = page.width
height = page.height
# Crop pages
left_bbox = (x0*float(width), y0*float(height), x1*float(width), y1*float(height))
page_crop = page.crop(bbox=left_bbox)
left_text = page_crop.extract_text()
left_bbox = (0.5*float(width), y0*float(height), 1*float(width), y1*float(height))
page_crop = page.crop(bbox=left_bbox)
right_text = page_crop.extract_text()
page_context = '\n'.join([left_text, right_text])
all_content.append(page_context)
if i < 2: # help you see the merged first two pages
print(page_context)

fitz
- 540
- 4
- 11
-
1Some pages may or may not have text spit into columns. How can I write an if- statement based on this? @fitz – StressedBoi69420 Nov 30 '21 at 09:40
-
1@StressedBoi69420 Do you mean the statement "if i < 2"? It is used for seeing the merged first two pages. Or you can provide more info – fitz Dec 01 '21 at 04:16
-
No, but that's useful to note. I'm talking about when to apply column extraction; **conditionally**. I've made a post about it here: https://stackoverflow.com/q/70170544/16105404 – StressedBoi69420 Dec 01 '21 at 09:06
0
This is code I use for regular pdf parsing, and it seems to work ok on that image (I downloaded an image, so this uses Optical Character Recognition, so its as accurate as regular OCR). Note that this tokenizes the text. Also note that you need to install tesseract for this to work (pytesseract just makes tesseract work from python). Tesseract is free & open source.
from PIL import Image
import pytesseract
import cv2
import os
def parse(image_path, threshold=False, blur=False):
image = cv2.imread(image_path)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
if threshold:
gray = cv2.threshold(gray, 0, 255, \
cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
if blur: #useful if salt-and-pepper background.
gray = cv2.medianBlur(gray, 3)
filename = "{}.png".format(os.getpid())
cv2.imwrite(filename, gray) #Create a temp file
text = pytesseract.image_to_string(Image.open(filename))
os.remove(filename) #Remove the temp file
text = text.split() #PROCESS HERE.
print(text)
a = parse(image_path, True, False)

Evan Mata
- 500
- 1
- 6
- 19
-
Also I may have borrowed that code from someone else a while back, I don't actually recall if that specific snippit is mine or someone elses. – Evan Mata Mar 11 '19 at 16:01
-