2

PDF_Doc

I've been working with the pdfplumber library to extract text from pdf documents and it's been fine, however in the documents I'm working on now, I just get spaces and lots of (cid:x) instead of text. Any solution? Thanks

with pdfplumber.open(fatura) as pdf:
    lista_paginas = pdf.pages

    fatura_individual = ''
    for pagina in lista_paginas[:len(lista_paginas)]:
        fatura_individual += pagina.extract_text()
       
(cid:12)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0),(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0),(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0),(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0),(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0),(cid:0)(cid:0)(cid:0)(cid:0)(cid:0),(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0),(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0),(cid:0)(cid:0)(cid:0)(cid:0)(cid:0),(cid:0)(cid:0)(cid:0)(cid:0)(cid:0),(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0),(cid:0)(cid:0)(cid:0)(cid:0)(cid:0),(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:16)

Just want to extract the full text

foliveir
  • 59
  • 5

1 Answers1

0

Try PyPDF2 : https://pypdf2.readthedocs.io/en/latest/user/extract-text.html

from PyPDF2 import PdfReader

reader = PdfReader("example.pdf")
for page in reader.pages:
    print(page.extract_text())
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
  • 1
    Hi, thanks, i already try that library it works, however I have a warning -> incorrect startxref pointer(1) , i have search but don't found information about, what's the problem, what could happen? Thanks – foliveir Nov 20 '22 at 22:03
  • The pdf is broken, but PyPDF2 can deal with it. It just let's you know. https://pypdf2.readthedocs.io/en/latest/user/suppress-warnings.html – Martin Thoma Nov 20 '22 at 22:57
  • So you say I can trust in the text extracted, just ingore de warning, right? – foliveir Nov 20 '22 at 23:32
  • What do you mean by "trust"? For pdf, no library can give you a guarantee that you get for every food what you expect. But PyPDF2 recently became pretty good in text extraction – Martin Thoma Nov 20 '22 at 23:43