How to extract text from a PDF file in Python?

Question

How can I extract text from a PDF file in Python?

I tried the following:

import sys
import pyPdf

def convertPdf2String(path):
      content = ""
      pdf = pyPdf.PdfFileReader(file(path, "rb"))
      for i in range(0, pdf.getNumPages()):
          content += pdf.getPage(i).extractText() + " \n"
          content = " ".join(content.replace(u"\xa0", u" ").strip().split())
      return content

f = open('a.txt','w+')

f.write(convertPdf2String(sys.argv[1]).encode("ascii","xmlcharrefreplace"))
f.close()

But the result is as follows, rather than readable text:

728;ˇˆ˜ ˚ˇˇ!""˘ˇˆ˙ˆ˝˛˛˛˛ˆ˜ˆ ˆ ˆ˘ˆ˛˙ˆ"ˆ˘"ˆˆˆ˜#$˙ˆ˚ˆ %&ˆ ˘˛ˆ˜'˙˙%˝˛ˆˇ˙ ˜ˆˆ˜'ˆ ˇˆ#$%&('%$&))$$+%#,-.+&&˝())˝)˝+,,-./012)(˝)*˝+,-3˙ˆ/0245)6#57+82,55)6#57+,+2,+ /!#!!&˘˘1"%˘20˛˛3ˆ07%4!˘"6 ˛ˆ ˝ˆ ˆ˘&/&4"9ˆ %6ˇ%4%4&5˘2)˘˘˛%:6(

A PDF file must not necessarily contain text (appearing as such) in a reasonable exportable way since there are various options how a PDF creation tool can deal with text. There is no guarantee that you can extract as a whole as you want it. I assume your PDF is one of those PDF files that look nice but in the way that you can extract the content in a reasonable way. — , Mar 23 '13 at 05:17
I think this is similar issue as I had here: [link](http://stackoverflow.com/questions/14474405/indexing-pdf-from-badly-authored-latex-source). If you need the information contained in such PDF file, your best bet would be to dump TIFF (i.e with ghostscript) and do OCR (i.e tesseract). — theta, Mar 23 '13 at 10:53
pypdf received tons of updates in 2022. The results would be different if you upgrade your pypdf version — Martin Thoma, Mar 01 '23 at 17:36

score 21 · Accepted Answer · answered Mar 23 '13 at 15:19

21

if you are running linux or mac you can use ps2ascii command in your code:

import os

input="someFile.pdf"
output="out.txt"
os.system(("ps2ascii %s %s") %( input , output))

answered Mar 23 '13 at 15:19

Moj

6,137
2
24
36

11

@anony try `pdftotext` instead of `ps2ascii` – Moj Nov 15 '13 at 15:05
1

what if i have to use it temporary ,, just for further processing of the text. – lazarus Apr 09 '15 at 10:11
@Moj It prints 0 instead of the text in the file. – Iqbal Oct 29 '15 at 05:14

How to extract text from a PDF file in Python?

728;ˇˆ˜ ˚ˇˇ!""˘ˇˆ˙ˆ˝˛˛˛˛ˆ˜ˆ ˆ ˆ˘ˆ˛˙ˆ"ˆ˘"ˆˆˆ˜#$˙ˆ˚ˆ %&ˆ ˘˛ˆ˜'˙˙%˝˛ˆˇ˙ ˜ˆˆ˜'ˆ ˇˆ#$%&('%$&))$$+%#,-.+&&˝())˝)˝+,,-./012)(˝)*˝+,-3˙ˆ/0245)6#57+82,55)6#57+,+2,+ /!#!!&˘˘1"%˘20˛˛3ˆ07%4!˘"6 ˛ˆ ˝ˆ ˆ˘&/&4"9ˆ %6ˇ%4%4&5˘2)˘˘˛%:6(

1 Answers1

Linked