Text extraction from PDF results in one long string (python)

Asked Nov 16 '18 at 06:05

Active Nov 16 '18 at 06:05

Viewed 122 times

I currently have the following function

def readFile(fileName):
    text = ""

    pdfFileObj = open(fileName, 'rt')

    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

    num_pages = pdfReader.numPages

    count = 0
    while count < num_pages:
            pageObj = pdfReader.getPage(count)
            text += pageObj.extractText()
            count += 1

    pdfFileObj.close()
    return text

But for most PDFs that I try this on it returns one long string without any spaces between words or sentences. Am I doing something wrong or is there a way to split up the words?

asked Nov 16 '18 at 06:05

Laurens

Do not assume that you *always* can extract *all* text from *any* PDF. The PDF format is not designed to be able to do that. A common litmus test is to try and copy the text using Acrobat Reader. If that succeeds, it indicates a limitation of your current toolchain. – Jongware Nov 16 '18 at 10:14
Please just update PyPDF2. It received tons of updates in 2022. Additionally, we merged the changes back into pypdf - so you could also just move there. – Martin Thoma Mar 01 '23 at 17:34

Text extraction from PDF results in one long string (python)

0 Answers0