-1

I am using this code to get text data from PDF :

def pdf_to_txt(path):    
    manager = PDFResourceManager()
    retstr = BytesIO()
    layout = LAParams(all_texts=True)
    device = TextConverter(manager, retstr, laparams=layout)
    filepath = open(path, 'rb')
    interpreter = PDFPageInterpreter(manager, device)
    for page in PDFPage.get_pages(filepath, check_extractable=True):
        interpreter.process_page(page)
    text = retstr.getvalue()
    filepath.close()
    device.close()
    retstr.close()
    return text

In my PDF file i have in line separator TAB example(i believe thats TAB because two words are in same column cell and separator have more than one whitespace): Hello this is

PDF miner is converting this line to :

    Hello
    this is

Expected output:

Hello this is

Does anyone have an idea how to set additional separator to this PDFminer to avoid creating new lines?

Thanks!

sygneto
  • 1,761
  • 1
  • 13
  • 26

1 Answers1

0

That was a bug inside one of PDF's to solve it I justed coordinates of lines to compare them, more info you can find here:How to extract text and text coordinates from a PDF file?

sygneto
  • 1,761
  • 1
  • 13
  • 26