-1

Team,

I have a pdf file about 6000+ pages. what's the fastest method I can use to extract the texts?

I am using this code

all_text = ""
with pdfplumber.open(pdf_dir) as pdf:
            for page in pdf.pages:
                text = page.extract_text()
                all_text += text

but it's taking a lot of time to complete

also after extracting I would then need to search for the address which I am using this code:

address_line = re.compile(r'(:  \d{5})')
for line in text.split('\n'):
    if address_line.search(line):
        print(line)

appreciate your help in advance :)

Maki
  • 13
  • 2

2 Answers2

0

Since you don't need to keep the whole text in memory - just iterate through pages lines and collect matched lines:

with pdfplumber.open(pdf_dir) as pdf:
    matched_lines = []
    address_line = re.compile(r'(:  \d{5})')
    for page in pdf.pages:
        text = page.extract_text()
        for line in text.split('\n'):
            if address_line.search(line):
                matched_lines.append(line)
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
  • this combines the code, thanks, but still having issues with the time to complete the task, i've been at it for 20+ minutes and still wait – Maki Jan 10 '23 at 10:07
0

You may find multiprocessing more efficient. Here's an example of how that could be done:

import pdfplumber
from re import compile
from sys import stderr
from concurrent.futures import ProcessPoolExecutor as PPE
from functools import partial

FILENAME = 'Maki.pdf'
PATTERN = compile(r'(:  \d{5})')

# return a list of all lines that contain a match of the regular expression
def extract(filename, page):
    result = []
    try:
        with pdfplumber.open(filename) as pdf:
            for line in pdf.pages[page].extract_text().split('\n'):
                if PATTERN.search(line):
                    result.append(line)
    except Exception as e:
        print(e, file=stderr)
    return result

def main(filename):
    with PPE() as ppe, pdfplumber.open(filename) as pdf:
        for future in ppe.map(partial(extract, filename), range(len(pdf.pages))):
            print(future)

if __name__ == '__main__':
    main(FILENAME)

Notes:

Rewritten. Need to avoid serialisation

DarkKnight
  • 19,739
  • 3
  • 6
  • 22
  • @Maki Since posting this I've done some more testing and there's a problem. dill can be slow. I have a 472 page PDF. Each call to dumps() is taking 0.44s. Therefore the serialisation is taking longer than the processing time in the subprocess(es). Rewritten to avoid any significant serialisation – DarkKnight Jan 10 '23 at 11:37
  • tried this one but got this error: BrokenProcessPool: A child process terminated abruptly, the process pool is not usable anymore – Maki Jan 11 '23 at 05:40
  • @Maki What environment are you running your code in? Have you used an **exact** copy/paste of this code (apart from the filename of course). See:- https://stackoverflow.com/questions/15900366/all-example-concurrent-futures-code-is-failing-with-brokenprocesspool – DarkKnight Jan 11 '23 at 07:22
  • using Python310, thanks for the link, i tried the same exact thing on the sample on the link provided (copy and paste) ran it and same error comes up – Maki Jan 11 '23 at 07:35
  • @Maki I don't know what you're doing wrong. This code on Python 3.11.1 (macOS 13.1) successfully processes a 503-page PDF in 23 seconds – DarkKnight Jan 11 '23 at 07:40