I'm really struggling to read my pdf files asynchronously. I tried using aiofiles which is open-source on GitHub. I want to extract the text from pdfs. I want to do it with pdfminer because pypdf is not rendering math (greek letters) or double letters (e.g. ff) properly for now.
The routine that works is:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
with open(pdf_filename, 'rb') as file:
resource_manager = PDFResourceManager(caching=False)
# Create a string buffer object for text extraction
text_io = StringIO()
# Create a text converter object
text_converter = TextConverter(resource_manager, text_io, laparams=LAParams())
# Create a PDF page interpreter object
page_interpreter = PDFPageInterpreter(resource_manager, text_converter)
# Process each page in the PDF file
async for page in extract_pages(file):
page_interpreter.process_page(page)
text = text_io.getvalue()
but then if I replace with open(pdf_filename, 'rb') as file
by async with aiofiles.open(pdf_filename, 'rb') as file
, the line async for page in extract_pages(file)
is not happy and I get this error:
async for page in extract_pages(file): TypeError: 'async for' requires an object with aiter method, got generator
So how do I get the file returned by aiofiles to be like a normal file with aiter?
And I use that to replace the original extract_pages function to try to make it work asynchronously:
async def extract_pages(file):
with file:
for page in PDFPage.get_pages(file, caching=False):
yield page
Many thanks if you can help me how to read a pdf file asynchronously in python with pdfminer or something equivalent that can read math.