I have a PDF for 5000 lines of fortran codes with strict fortran format, for example, codes are written after column 5, column 5 is reserved for line continuation etc. When I extract text from PDF using online tool, the created texts are all started for column 1. Now I am hoping python pdfminer etc. can help me.
I found the similar codes from here but no text is print, not sure what is wrong. I am wondering how can I save texts into a csv or fortan .for file? Thanks
Python module for converting PDF to text
from pdfminer3.layout import LAParams, LTTextBox
from pdfminer3.pdfpage import PDFPage
from pdfminer3.pdfinterp import PDFResourceManager
from pdfminer3.pdfinterp import PDFPageInterpreter
from pdfminer3.converter import PDFPageAggregator
from pdfminer3.converter import TextConverter
import io
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle, laparams=LAParams())
page_interpreter = PDFPageInterpreter(resource_manager, converter)
with open('gzw_umat.pdf', 'rb') as fh:
for page in PDFPage.get_pages(fh,
caching=True,
check_extractable=True):
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
# close open handles
converter.close()
fake_file_handle.close()
print(text)