PDF miner, bad new line detection

Question

I am using this code to get text data from PDF :

def pdf_to_txt(path):    
    manager = PDFResourceManager()
    retstr = BytesIO()
    layout = LAParams(all_texts=True)
    device = TextConverter(manager, retstr, laparams=layout)
    filepath = open(path, 'rb')
    interpreter = PDFPageInterpreter(manager, device)
    for page in PDFPage.get_pages(filepath, check_extractable=True):
        interpreter.process_page(page)
    text = retstr.getvalue()
    filepath.close()
    device.close()
    retstr.close()
    return text

In my PDF file i have in line separator TAB example(i believe thats TAB because two words are in same column cell and separator have more than one whitespace): Hello this is

PDF miner is converting this line to :

    Hello
    this is

Expected output:

Hello this is

Does anyone have an idea how to set additional separator to this PDFminer to avoid creating new lines?

Thanks!

I'd say that that's a bug, so file a bug report. If you really want to fix this yourself, you'd have to provide a [mcve]. — Ulrich Eckhardt, Aug 26 '19 at 08:21

score 0 · Accepted Answer · answered Aug 28 '19 at 10:04

0

That was a bug inside one of PDF's to solve it I justed coordinates of lines to compare them, more info you can find here:How to extract text and text coordinates from a PDF file?

answered Aug 28 '19 at 10:04

sygneto

1,761
1
13
26

PDF miner, bad new line detection

1 Answers1