Faster way to extract text from PDF file

Question

Team,

I have a pdf file about 6000+ pages. what's the fastest method I can use to extract the texts?

I am using this code

all_text = ""
with pdfplumber.open(pdf_dir) as pdf:
            for page in pdf.pages:
                text = page.extract_text()
                all_text += text

but it's taking a lot of time to complete

also after extracting I would then need to search for the address which I am using this code:

address_line = re.compile(r'(:  \d{5})')
for line in text.split('\n'):
    if address_line.search(line):
        print(line)

appreciate your help in advance :)

so you don't need the whole text after searching/matching with address_line ? — RomanPerekhrest, Jan 10 '23 at 07:34
You can process the pages using multithreading with a library like `concurrent.futures` — Albin Paul, Jan 10 '23 at 07:34
@RomanPerekhrest yes, I think that will cut the process short too. — Maki, Jan 10 '23 at 07:36
@Maki, and one more thing: do you want to collect matched `line`s from text OR just print them? — RomanPerekhrest, Jan 10 '23 at 07:37
@AlbinPaul This code will be CPU-intensive. Therefore, multithreading will not be appropriate. multiprocessing would make more sense — DarkKnight, Jan 10 '23 at 07:42

score 0 · Answer 1 · answered Jan 10 '23 at 07:52

0

Since you don't need to keep the whole text in memory - just iterate through pages lines and collect matched lines:

with pdfplumber.open(pdf_dir) as pdf:
    matched_lines = []
    address_line = re.compile(r'(:  \d{5})')
    for page in pdf.pages:
        text = page.extract_text()
        for line in text.split('\n'):
            if address_line.search(line):
                matched_lines.append(line)

answered Jan 10 '23 at 07:52

RomanPerekhrest

88,541
4
65
105

this combines the code, thanks, but still having issues with the time to complete the task, i've been at it for 20+ minutes and still wait – Maki Jan 10 '23 at 10:07

DarkKnight · Answer 2 · 2023-01-10T11:58:29.143

0

You may find multiprocessing more efficient. Here's an example of how that could be done:

import pdfplumber
from re import compile
from sys import stderr
from concurrent.futures import ProcessPoolExecutor as PPE
from functools import partial

FILENAME = 'Maki.pdf'
PATTERN = compile(r'(:  \d{5})')

# return a list of all lines that contain a match of the regular expression
def extract(filename, page):
    result = []
    try:
        with pdfplumber.open(filename) as pdf:
            for line in pdf.pages[page].extract_text().split('\n'):
                if PATTERN.search(line):
                    result.append(line)
    except Exception as e:
        print(e, file=stderr)
    return result

def main(filename):
    with PPE() as ppe, pdfplumber.open(filename) as pdf:
        for future in ppe.map(partial(extract, filename), range(len(pdf.pages))):
            print(future)

if __name__ == '__main__':
    main(FILENAME)

Notes:

Rewritten. Need to avoid serialisation

edited Jan 10 '23 at 11:58

answered Jan 10 '23 at 10:15

DarkKnight

19,739
3
6
22

@Maki Since posting this I've done some more testing and there's a problem. dill can be slow. I have a 472 page PDF. Each call to dumps() is taking 0.44s. Therefore the serialisation is taking longer than the processing time in the subprocess(es). Rewritten to avoid any significant serialisation – DarkKnight Jan 10 '23 at 11:37
tried this one but got this error: BrokenProcessPool: A child process terminated abruptly, the process pool is not usable anymore – Maki Jan 11 '23 at 05:40
@Maki What environment are you running your code in? Have you used an **exact** copy/paste of this code (apart from the filename of course). See:- https://stackoverflow.com/questions/15900366/all-example-concurrent-futures-code-is-failing-with-brokenprocesspool – DarkKnight Jan 11 '23 at 07:22
using Python310, thanks for the link, i tried the same exact thing on the sample on the link provided (copy and paste) ran it and same error comes up – Maki Jan 11 '23 at 07:35
@Maki I don't know what you're doing wrong. This code on Python 3.11.1 (macOS 13.1) successfully processes a 503-page PDF in 23 seconds – DarkKnight Jan 11 '23 at 07:40

Faster way to extract text from PDF file

2 Answers2