2

I have a PDF for 5000 lines of fortran codes with strict fortran format, for example, codes are written after column 5, column 5 is reserved for line continuation etc. When I extract text from PDF using online tool, the created texts are all started for column 1. Now I am hoping python pdfminer etc. can help me.

I found the similar codes from here but no text is print, not sure what is wrong. I am wondering how can I save texts into a csv or fortan .for file? Thanks

Python module for converting PDF to text

from pdfminer3.layout import LAParams, LTTextBox
from pdfminer3.pdfpage import PDFPage
from pdfminer3.pdfinterp import PDFResourceManager
from pdfminer3.pdfinterp import PDFPageInterpreter
from pdfminer3.converter import PDFPageAggregator
from pdfminer3.converter import TextConverter
import io

resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle, laparams=LAParams())
page_interpreter = PDFPageInterpreter(resource_manager, converter)

with open('gzw_umat.pdf', 'rb') as fh:

    for page in PDFPage.get_pages(fh,
                                  caching=True,
                                  check_extractable=True):
        page_interpreter.process_page(page)

    text = fake_file_handle.getvalue()

# close open handles
converter.close()
fake_file_handle.close()

print(text)

enter image description here

roudan
  • 3,082
  • 5
  • 31
  • 72
  • 2
    all the packages you will find have their own limitations and in a worse scenario will lead to OCR(optical character recognition). As you are not seeing any output, most likely your PDF is a scanned copy of the code. – simpleApp Apr 28 '21 at 03:09
  • 2
    there is no formatting in a PDF, all text is hard coded placed on the page, you will not find any (leading) space or tab, and it is not sure what the order of the text in the PDF is. – rioV8 Apr 28 '21 at 03:15
  • Then any other solution? Thx – roudan Apr 28 '21 at 03:24

0 Answers0