1

I'm really struggling to read my pdf files asynchronously. I tried using aiofiles which is open-source on GitHub. I want to extract the text from pdfs. I want to do it with pdfminer because pypdf is not rendering math (greek letters) or double letters (e.g. ff) properly for now.

The routine that works is:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

with open(pdf_filename, 'rb') as file:

    resource_manager = PDFResourceManager(caching=False)

    # Create a string buffer object for text extraction
    text_io = StringIO()

    # Create a text converter object
    text_converter = TextConverter(resource_manager, text_io, laparams=LAParams())

    # Create a PDF page interpreter object
    page_interpreter = PDFPageInterpreter(resource_manager, text_converter)

    # Process each page in the PDF file
    
    async for page in extract_pages(file):
        page_interpreter.process_page(page)

     
    text = text_io.getvalue()

but then if I replace with open(pdf_filename, 'rb') as file by async with aiofiles.open(pdf_filename, 'rb') as file, the line async for page in extract_pages(file) is not happy and I get this error:

async for page in extract_pages(file): TypeError: 'async for' requires an object with aiter method, got generator

So how do I get the file returned by aiofiles to be like a normal file with aiter?

And I use that to replace the original extract_pages function to try to make it work asynchronously:

async def extract_pages(file):
    with file:
        for page in PDFPage.get_pages(file, caching=False):
            yield page

Many thanks if you can help me how to read a pdf file asynchronously in python with pdfminer or something equivalent that can read math.

Quentin
  • 45
  • 1
  • 7
  • The `async def extract_pages(...` variant should work. Is the error message shown with it? Which Python version do you use? – Michael Butscher Apr 26 '23 at 09:05
  • The async def extract_pages works but only if I open the pdf with open. But if I want to open the pdf asynchronously using async with aiofiles.open(pdf_filename, 'rb') instead of the usual open function then I get the error displayed. I need to open the file asynchronously (in addition to running extract_pages async) because otherwise it blocks my websockets. – Quentin Apr 26 '23 at 14:04
  • As pdfminer doesn't seem to support async execution, you must run `PDFPage.get_pages` in a separate thread using `loop.run_in_executor`. If `get_pages` returns an iterator (instead of a list or tuple), it may be necessary to replace the for-loop with a while-loop and calls to `next` to iterate over the synchronous iterator. Then the synchronous file object won't disturb the websockets anymore because all time consuming operations happen in another thread. – Michael Butscher Apr 26 '23 at 14:38
  • Thanks a lot for your answer. Could you just show me what the code would look like because I'm unsure? – Quentin Apr 26 '23 at 20:37

1 Answers1

1

PDFPage.get_pages is really a generator, so it must be wrapped in an asynchronous generator. I haven't found a ready-made solution to do this, so here is my own:

import asyncio


class WrappedStopIteration(Exception):
    """ "StopIteration" can't be transferred through a Future, so we need our own replacement"""
    pass


def nextwrap(it):
    try:
        return next(it)
    except StopIteration as e:
        raise WrappedStopIteration(e)


async def agen(it):
    loop = asyncio.get_running_loop()
    try:
        while True:
            v = await loop.run_in_executor(None, nextwrap, it)
            yield v
    except WrappedStopIteration:
        pass

(Caveat: Fails if thread-local variables are used or the generator/iterator otherwise assumes that it is executed completely in the same thread.)

In your case it can be used as follows:

async def extract_pages(file):
    
    # "with file:" can be omitted because there is already the outer "with"
    # enclosing the whole execution

    async for page in agen(PDFPage.get_pages(file, caching=False)):
        yield page
Michael Butscher
  • 10,028
  • 4
  • 24
  • 25
  • It works when I use with open(pdf_filename, 'rb') as file: but if I use async with aiofiles.open(pdf_filename, 'rb') as file: (which is the whole point of my message because the previous code I gave above was already working with the usual open) then I get a long error message finishing with: File "/usr/local/lib/python3.9/site-packages/pdfminer/psparser.py", line 280, in revreadlines while 0 < pos: TypeError: '<' not supported between instances of 'int' and 'generator' – Quentin Apr 27 '23 at 15:16
  • @Quentin This can't work with an asynchronous file object because pdfminer doesn't support such files. But it shouldn't be necessary anyway because the blocking access to the file by pdfminer happens in a separate thread and the websockets in main thread shouldn't be affected. – Michael Butscher Apr 27 '23 at 18:05