19

How can I extract text from a PDF file in Python?

I tried the following:

import sys
import pyPdf

def convertPdf2String(path):
      content = ""
      pdf = pyPdf.PdfFileReader(file(path, "rb"))
      for i in range(0, pdf.getNumPages()):
          content += pdf.getPage(i).extractText() + " \n"
          content = " ".join(content.replace(u"\xa0", u" ").strip().split())
      return content

f = open('a.txt','w+')

f.write(convertPdf2String(sys.argv[1]).encode("ascii","xmlcharrefreplace"))
f.close()

But the result is as follows, rather than readable text:

728;ˇˆ˜ ˚ˇˇ!""˘ˇˆ˙ˆ˝˛˛˛˛ˆ˜ˆ ˆ ˆ˘ˆ˛˙ˆ"ˆ˘"ˆˆˆ˜#$˙ˆ˚ˆ %&ˆ ˘˛ˆ˜'˙˙%˝˛ˆˇ˙ ˜ˆˆ˜'ˆ ˇˆ#$%&('%$&))$$+%#,-.+&&˝())˝)˝+,,-./012)(˝)*˝+,-3˙ˆ/0245)6#57+82,55)6#57+,+2,+ /!#!!&˘˘1"%˘20˛˛3ˆ07%4!˘"6 ˛ˆ ˝ˆ ˆ˘&/&4"9ˆ %6ˇ%4%4&5˘2)˘˘˛%:6(

Martin Atkins
  • 62,420
  • 8
  • 120
  • 138
lost
  • 211
  • 1
  • 2
  • 9
  • 3
    A PDF file must not necessarily contain text (appearing as such) in a reasonable exportable way since there are various options how a PDF creation tool can deal with text. There is no guarantee that you can extract as a whole as you want it. I assume your PDF is one of those PDF files that look nice but in the way that you can extract the content in a reasonable way. –  Mar 23 '13 at 05:17
  • I think this is similar issue as I had here: [link](http://stackoverflow.com/questions/14474405/indexing-pdf-from-badly-authored-latex-source). If you need the information contained in such PDF file, your best bet would be to dump TIFF (i.e with ghostscript) and do OCR (i.e tesseract). – theta Mar 23 '13 at 10:53
  • pypdf received tons of updates in 2022. The results would be different if you upgrade your pypdf version – Martin Thoma Mar 01 '23 at 17:36

1 Answers1

21

if you are running linux or mac you can use ps2ascii command in your code:

import os

input="someFile.pdf"
output="out.txt"
os.system(("ps2ascii %s %s") %( input , output))
Moj
  • 6,137
  • 2
  • 24
  • 36