Read PDF in Python and convert to text in PDF

Question

I have used this code to convert pdf to text.

input1 = '//Home//Sai Krishna Dubagunta.pdf'
output = '//Home//Me.txt'
os.system(("pdftotext %s %s") %( input1, output))

I have created the Home directory and pasted the source file in it.

The output I get is

And no file with .txt was created. Where is the Problem?

check error code 1 http://msdn.microsoft.com/en-us/library/ms681382(v=vs.85).aspx — ashishmaurya, May 23 '14 at 05:03

Martin Thoma · Answer 1 · 2022-12-26T07:46:19.207

There are various Python packages to extract the text from a PDF with Python. You can see a speed/quality benchmark.

As the maintainer of pypdf and PyPDF2 I am biased, but I would recommend pypdf for people to start. It's pure-python and a BSD 3-clause license. That should work for most people. Also pypdf can do way more with PDF files (e.g. transformations).

If you feel comfortable with the C-dependency and don't want to modify the PDF, give pypdfium2 a shot. pypdfium2 is really fast and has an amazing extraction quality.

I previously recommended popplers pdftotext. Don't use that. It's quality is worse than PDFium/PyPDF2.

Tika and PyMuPDF work similarly well as PDFium, but they also have the non-python dependency. PyMuPDF might not work for you due to the commercial license.

I would NOT use pdfminer / pdfminer.six / pdfplumber/ pdftotext / borb / PyPDF2 / PyPDF3 / PyPDF4.

pypdf: Pure Python

Installation: pip install pypdf (more instructions)

from pypdf import PdfReader

reader = PdfReader("example.pdf")
text = ""
for page in reader.pages:
    text += page.extract_text() + "\n"

PDFium: High quality and very fast, but with C-dependency

Installation: pip install pypdfium2

import pypdfium2 as pdfium

text = ""
pdf = pdfium.PdfDocument(data)
for i in range(len(pdf)):
    page = pdf.get_page(i)
    textpage = page.get_textpage()
    text += textpage.get_text()
    text += "\n"
    [g.close() for g in (textpage, page)]
pdf.close()

This is the best answer. FYI, `pdftotext` requires you [first install `poppler`, which is a little painful on Windows](https://stackoverflow.com/questions/45912641/unable-to-install-pdftotext-on-python-3-6-missing-poppler) — smci, Apr 04 '19 at 22:11
@TomásGomezPizarro PyMuPDF is AGPLv3. Simply speaking, this means you are forbidden to use it in a closed source (rsp. not freely licensed) public project. This is legally binding (case law exists). For uses that don't comply with the AGPL, you have to buy a license from Artifex. — mara004, Jun 06 '23 at 13:00

score 4 · Accepted Answer · answered May 23 '14 at 05:06

4

Your expression

("pdftotext %s %s") %( input1, output)

will translate to

pdftotext //Home//Sai Krishna Dubagunta.pdf //Home//Me.txt

which means that the first parameter passed to pdftotext is //Home//Sai, and the second parameter is Krishna. That obviously won't work.

Enclose the parameters in quotes:

os.system("pdftotext '%s' '%s'" % (input1, output))

answered May 23 '14 at 05:06

Tim Pietzcker

328,213
58
503
561

And That didn't Work @Tim Pietzcker – Krishna May 23 '14 at 05:23
2

"Didn't work" is not really helpful. What exactly were the results when you used that command? I'm not a Unix person, but are there really supposed to be double slashes in paths? What happens if you type `pdftotext '//Home//Sai Krishna Dubagunta.pdf' '//Home//Me.txt'` in the directory that you're running the Python script in? – Tim Pietzcker May 23 '14 at 05:30
Double slashes is specifying a single slash in the input string. same as in C to print or to specify / we use //. The result is 1. That means according to the Error Codes, it is Invalid Function. – Krishna May 23 '14 at 05:38
1

@Krishna: Are you sure you're not confusing slashes `"/"` and backslashes `"\"`? – Tim Pietzcker May 23 '14 at 05:39
Confused. Always had a problem with that. – Krishna May 23 '14 at 05:42
Best to use rawstring `r''` so you don't need to escape backslashes: `r'/Home/Me'`@TimPietzcker et al: since 1995, Windows has accepted '/' as equivalent to `\` – smci Apr 04 '19 at 22:09
Anyway it's better to use the Python wrapper to `pdftotext` as @MartinThoma shows. – smci Apr 04 '19 at 22:12

score 0 · Answer 3 · answered May 23 '14 at 05:04

0

I think pdftotext command takes only one argument. Try using:

os.system(("pdftotext %s") % input1)

and see what happens. Hope this helps.

answered May 23 '14 at 05:04

haraprasadj

1,059
1
8
17

Then Where does the output come? I have to give a output path right ? some place to store the file. And the same output. Sorry. – Krishna May 23 '14 at 05:23
I came across your question while searching for some info on pdf automation (testing). I based my remark on this: http://en.wikipedia.org/wiki/Pdftotext where it is mentioned: $ pdftotext file.pdf This usage produces a text file with the same name as the input file. Wildcards (*), for example $ pdftotext *pdf, for converting multiple files, cannot be used because pdftotext expects only one file name. I could have misunderstood the question. – haraprasadj May 23 '14 at 05:59
I missed out a package that must be installed according to an user in another forum. [link](http://bytes.com/topic/python/answers/500078-convert-pdf-files-txt-files) But I couldn't try as I don't know how to install that package. I'll try it using PyCharm – Krishna May 23 '14 at 06:16

Read PDF in Python and convert to text in PDF

3 Answers3

pypdf: Pure Python

PDFium: High quality and very fast, but with C-dependency

Linked