I am using this code to get text data from PDF :
def pdf_to_txt(path):
manager = PDFResourceManager()
retstr = BytesIO()
layout = LAParams(all_texts=True)
device = TextConverter(manager, retstr, laparams=layout)
filepath = open(path, 'rb')
interpreter = PDFPageInterpreter(manager, device)
for page in PDFPage.get_pages(filepath, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
filepath.close()
device.close()
retstr.close()
return text
In my PDF file i have in line separator TAB example(i believe thats TAB because two words are in same column cell and separator have more than one whitespace):
Hello this is
PDF miner is converting this line to :
Hello
this is
Expected output:
Hello this is
Does anyone have an idea how to set additional separator to this PDFminer to avoid creating new lines?
Thanks!