Extracting text from two column pdf using python

Question

I am trying to extract text from a two-column pdf. On using pypdf2 and pdfplumber it is reading the page horizontally, i.e. after reading first line of column one it goes on to read first line of column two instead of second line of column one. I have also tried this code githubusercontent as it is, but I have the same issue. I also saw this How to extract text from two column pdf with Python? but I dont want to convert to image as I have thousands of pages. Any help will be appreciated. Thannnks!

score 1 · Answer 1 · answered Apr 05 '22 at 03:33

You can check this blog here which uses PyMuPDF to extract two column pdfs like research papers.

https://towardsdatascience.com/read-a-multi-column-pdf-using-pymupdf-in-python-4b48972f82dc

From what I have tested so far, it works quite well. I would highly recommend the "blocks" option.

# OCR the PDF using the default 'text' parameter
with fitz.open(DIGITIZED_FILE_PATH) as doc:
    for page in doc:
        text = page.get_text("blocks")
        print(text)

Note: It does not work for scanned images. It works only for searchable pdf files.

Extracting text from two column pdf using python

1 Answers1