187

Can anyone recommend a library/API for extracting the text and images from a PDF? We need to be able to get at text that is contained in pre-known regions of the document, so the API will need to give us positional information of each element on the page.

We would like that data to be output in xml or json format. We're currently looking at PdfTextStream which seems pretty good, but would like to hear other peoples experiences and suggestions.

Are there alternatives (commercial ones or free) for extracting text from a pdf programatically?

Jonathan
  • 4,847
  • 3
  • 32
  • 37
Budda007
  • 1,903
  • 2
  • 12
  • 3
  • Related question: [Extract Images and Words with coordinates and sizes from PDF](http://stackoverflow.com/questions/8241724/extract-images-and-words-with-coordinates-and-sizes-from-pdf) – yms May 02 '13 at 16:59
  • 2
    For those needing something really simple (no position info), this perl regex may suffice: `/^\s*\[?\((.*?)\)\]?\s*T[Jj]/mg`. It just looks for the Tj/TJ operator, which denotes all normal text in a PDF. – Alex R Oct 24 '15 at 22:37
  • 1
    use [TomRoush PdfBox](https://github.com/TomRoush/PdfBox-Android) library this works good on android – FaisalAhmed Mar 17 '17 at 11:42
  • 1
    Library recommendations are off topic for Stack Overflow. Such questions could be on topic on https://softwarerecs.stackexchange.com/ Before asking there, please read their help center and asking guidance. – Dalija Prasnikar Oct 26 '22 at 11:02

15 Answers15

148

I was given a 400 page pdf file with a table of data that I had to import - luckily no images. Ghostscript worked for me:

gswin64c -sDEVICE=txtwrite -o output.txt input.pdf

The output file was split into pages with headers, etc., but it was then easy to write an app to strip out blank lines, etc, and suck in all 30,000 records. -dSIMPLE and -dCOMPLEX made no difference in this case.

Umber Ferrule
  • 3,358
  • 6
  • 35
  • 38
user2176753
  • 1,549
  • 1
  • 9
  • 7
  • 49
    On linux and cygwin the command is `gs` instead of `gswin64c` . Works perfectly. No patented paid crap. It just works. – Jannes Jun 15 '15 at 09:48
  • 4
    Yup, works great! Now I can use "grep" with impunity on my pdf files. Since I can grep better than I can read, it's a win! (:-) Upvote. – David Elson Aug 08 '15 at 22:54
  • 1
    The only problem I had with this was using it on pdfs with embedded 'old' fonts. Works perfectly for locally generated pdfs, but harder with obscure sources. Otherwise, an excellent scriptlet. – Jon M Dec 23 '17 at 12:38
  • what does `-sDEVICE=txtwrite` do? I don't understand much after reading [How to Use Ghostscript | Selecting an output device](https://www.ghostscript.com/doc/9.52/Use.htm#Output_device "How to Use Ghostscript") – Ooker Apr 06 '20 at 13:58
  • For stdout output instead of saving as a text file, use `gswin64c -sDEVICE=txtwrite -o- input.pdf`. Source (slightly changed by me): https://gist.github.com/drmohundro/560d72ed06baaf16f191ee8be34526ac – LuH Jun 15 '20 at 09:21
  • Thanks so much! I've been struggling to read a [PDF](https://transition.fcc.gov/Bureaus/MB/Databases/cdbs/_Engineering_Data_Description.pdf) with a table in it for almost three days and this plus a simple CSharp script solved it for me. Did a better job than Word, Adobe Acrobat DC, Tabula, or any other tools I've tried. Real life saver. – RomanPort Oct 03 '21 at 19:16
43

An efficient command line tool, open source, free of any fee, available on both linux & windows : simply named pdftotext. This tool is a part of the xpdf library.

http://en.wikipedia.org/wiki/Pdftotext

131
  • 3,071
  • 31
  • 32
  • 8
    On a sidenote: use the `-layout` switch to preserve tables, works pretty well. – sebastian Feb 08 '16 at 08:57
  • Yes, PDFToText works surprisingly well. Nothing’s perfect, but this is the best of the bunch I tried. I like that it has several different algorithms you can pick from. Some algorithms work better with tables, others work better for multi-column text, some preserve spaces and some trim spaces, etc. It’s also surprisingly fast. I had a massive 1200-page PDF and it extracted the text in a matter of seconds, about 5-10x faster than Ghostscript. – Simon East Nov 16 '21 at 09:37
  • Official website is https://www.xpdfreader.com – Simon East Nov 16 '21 at 09:40
  • Great! Much better results than with `gs`. `pdftext` merges the hyphened words at end line (except when option `-layout` active) and keeps some inter-word spaces which, strangely, `gs` merges. – loved.by.Jesus Feb 07 '23 at 14:26
  • 2023 note: This is indeed the optimal solution. If you're on macOS you're looking for `brew install poppins` to get `pdftotext` and other utils. – snarik Feb 16 '23 at 22:58
31

Since today I know it: the best thing for text extraction from PDFs is TET, the text extraction toolkit. TET is part of the PDFlib.com family of products.

PDFlib.com is Thomas Merz's company. In case you don't recognize his name: Thomas Merz is the author of the "PostScript and PDF Bible".

TET's first incarnation is a library. That one can probably do everything Budda006 wanted, including positional information about every element on the page. Oh, and it can also extract images. It recombines images which are fragmented into pieces.

pdflib.com also offers another incarnation of this technology, the TET plugin for Acrobat. And the third incarnation is the PDFlib TET iFilter. This is a standalone tool for user desktops. Both these are free (as in beer) to use for private, non-commercial purposes.

And it's really powerful. Way better than Adobe's own text extraction. It extracted text for me where other tools (including Adobe's) do spit out garbage only.

I just tested the desktop standalone tool, and what they say on their webpage is true. It has a very good commandline. Some of my "problematic" PDF test files the tool handled to my full satisfaction.

This thing will from now on be my recommendation for every sophisticated and challenging PDF text extraction requirements.

TET is simply awesome. It detects tables. Inside tables, it identifies cells spanning multiple columns. It identifies table rows and contents of each table cell separately. It deals very well with hyphenations: it removes hyphens and restores complete words. It supports non-ASCII languages (including CJK, Arabic and Hebrew). When encountering ligatures, it restores the original characters...

Give it a try.

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • 38
    There is no trial version, and $440 is a bit much to "Give it a try." – Rok Strniša Sep 13 '13 at 12:41
  • 19
    @Darthenius: You must have missed this sentence: "[PDFlib TET can be evaluated without a license, but will only process PDF documents with up to 10 pages and 1 MB size unless a valid license key is applied](http://www.pdflib.com/download/tet/)". – Kurt Pfeifle Sep 13 '13 at 13:39
  • 1
    i tested it, it doesnt recognize columns. I scanned an english tabloid front page. The text was split into 3 columns on the paper, but this plugin mixed the sentences altogether making it look jibberish. Ghostscript which is free had exact same output. – NoWhereToBeSeen Aug 04 '17 at 12:17
  • @RedHotScalability: BTW, you may have more luck with [this answer](https://stackoverflow.com/a/6189489/359307), **`pdftotext`** section. But I insist you add the **`-layout`** param... – Kurt Pfeifle Aug 04 '17 at 12:23
  • 1
    @RedHotScalability: Also BTW, the TET *does* recognize colums if used with the correct parameters. But I leave it as an exercize to the ambitious JS scripter to read the documentation and find out how... – Kurt Pfeifle Aug 04 '17 at 12:26
  • Thanks @Kurt. My current use case is being able to recognise text regions, like acknoeledgements, references etc. Do you have any advice about how to go about that? – lucid_dreamer Aug 25 '17 at 00:47
  • 4
    Just compared the results from TET, Xpdf pdftotext and Ghostscript. PDF file had Latin and Cyrillic script, and multi-column layout. Xpdf pdftotext was the best, then Ghostscript and the worst was TET. – zoran Mar 13 '19 at 14:31
  • @Kurt Pfeifle xpdf-tools-win-4.01, Ghostscript 9.26, TET 5.1. Ended up using Apache Tika 1.20 – zoran Mar 15 '19 at 22:29
22

For python, there is PDFMiner and pyPDF2. For more information on these, see Python module for converting PDF to text.

Community
  • 1
  • 1
Jonathan
  • 4,847
  • 3
  • 32
  • 37
13

Here is my suggestion. If you want to extract text from PDF, you could import the pdf file into Google Docs, then export it to a more friendly format such as .html, .odf, .rtf, .txt, etc. All of this using the Drive API. It is free* and robust. Take a look at:

https://developers.google.com/drive/v2/reference/files/insert https://developers.google.com/drive/v2/reference/files/get

Because it is a rest API, it is compatible with ALL programing languages. The links I posted aboove have working examples for many languages including: Java, .NET, Python, PHP, Ruby, and others.

I hope it helps.

dev4life
  • 10,785
  • 6
  • 60
  • 73
  • 3
    I've used that option and I wouldn't recommend it. Google's pdf text extraction isn't as good as many alternatives (esp. for non-English) and it is also very very sloooow. – Björn Lindqvist May 19 '14 at 09:53
  • I just tested this in the standard Google Docs UI, and I was actually surprised at how well this did. It correctly parsed a document with multiple text columns, and was the only tool I tried that removed line returns where it thought the text was the continuation of a single paragraph, but kept line-returns in other places. It didn't get this perfectly right, and needed some manual refinement, but it appears to be better than most other tools that just force line returns at the end of every line in a PDF. – Simon East Oct 26 '21 at 11:26
  • I have deep privacy concerns about google: what they exactly do with the data/files you upload to their drive. As a general rule, always preferring offline methods: `gs` or `pdftotext`. It seems to me a waste of energy and resources uploading your file to google, using their servers, and then downloading—My opinion. – loved.by.Jesus Feb 07 '23 at 13:46
11

PdfTextStream (which you said you have been looking at) is now free for single threaded applications. In my opinion its quality is much better than other libraries (esp. for things like funky embedded fonts, etc).

It is available in Java and C#.

Alternatively, you should have a look at Apache PDFBox, open source.

Renaud
  • 16,073
  • 6
  • 81
  • 79
7

One of the comments here used gs on Windows. I had some success with that on Linux/OSX too, with the following syntax:

gs \
 -q \
 -dNODISPLAY \
 -dSAFER \
 -dDELAYBIND \
 -dWRITESYSTEMDICT \
 -dSIMPLE \
 -f ps2ascii.ps \
 "${input}" \
 -dQUIET \
 -c quit

I used dSIMPLE instead of dCOMPLEX because the latter outputs 1 character per line.

kvz
  • 5,517
  • 1
  • 42
  • 33
6

Docotic.Pdf library may be used to extract text from PDF files as plain text or as a collection of text chunks with coordinates for each chunk.

Docotic.Pdf can be used to extract images from PDFs, too.

Disclaimer: I work for Bit Miracle.

Bobrovsky
  • 13,789
  • 19
  • 80
  • 130
5

As the question is specifically about alternative tools to get data from PDF as XML so you may be interested to take a look at the commercial tool "ByteScout PDF Extractor SDK" that is capable of doing exactly this: extract text from PDF as XML along with the positioning data (x,y) and font information:

Text in the source PDF:

Products | Units | Price 

Output XML:

 <row>
 <column>
  <text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="212" y="126" width="47" height="11">Products</text> 
  </column>
 <column>
  <text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="428" y="126" width="27" height="11">Units</text> 
  </column>
 <column>
  <text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="503" y="126" width="26" height="11">Price</text> 
  </column>
</row>

P.S.: additionally it also breaks the text into a table based structure.

Disclosure: I work for ByteScout

Eugene
  • 2,820
  • 19
  • 24
3

I know that this topic is quite old, but this need is still alive. I read many documents, forum and script and build a new advanced one which supports compressed and uncompressed pdf :

https://gist.github.com/smalot/6183152

In some cases, command line is forbidden for security reasons. So a native PHP class can fit many needs.

Hope it helps everone

Sebastien Malot
  • 369
  • 3
  • 9
3

The best thing I can currently think of (within the list of "simple" tools) is Ghostscript (current version is v.8.71) and the PostScript utility program ps2ascii.ps. Ghostscript ships it in its lib subdirectory. Try this (on Windows):

gswin32c.exe ^
   -q ^
   -sFONTPATH=c:/windows/fonts ^
   -dNODISPLAY ^
   -dSAFER ^
   -dDELAYBIND ^
   -dWRITESYSTEMDICT ^
   -dCOMPLEX ^
   -f ps2ascii.ps ^
   -dFirstPage=3 ^
   -dLastPage=7 ^
   input.pdf ^
   -dQUIET ^
   -c quit

This command processes pages 3-7 of input.pdf. Read the comments in the ps2ascii.ps file itself to see what the "weird" numbers and additional infos mean (they indicate strings, positions, widths, colors, pictures, rectangles, fonts and page breaks...). To get a "simple" text output, replace the -dCOMPLEX part by -dSIMPLE.

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • 2
    As you would guess, this only outputs ASCII test. While free, not a great option for software that you plan to with languages other than English. – userx Sep 08 '10 at 21:57
  • 3
    @userx: As you could guess, this is Free software: therefore source code available. Possible to extend for support of non-ASCII... – Kurt Pfeifle Sep 09 '10 at 23:21
  • @userx: today I discovered 'TET', the Text Extraction Toolkit from pdflib.com. See my other answer. – Kurt Pfeifle Sep 15 '10 at 23:27
  • **ps2ascii** from Ghostscript 9.07 worked beautifully on my OpenBSD system. I just converted a 526-page PDF to plain text. Now I can easily grep and extract text for notes. I used the simple command `ps2ascii book.pdf notes.txt`. If your document is predominately ASCII, you're in luck. – Clint Pachl Apr 18 '20 at 08:48
2

For image extraction, pdfimages is a free command line tool for Linux or Windows (win32):

pdfimages: Extract and Save Images From A Portable Document Format ( PDF ) File

Sun
  • 2,595
  • 1
  • 26
  • 43
2

Apache pdfbox has this feature - the text part is described in:

http://pdfbox.apache.org/apidocs/org/apache/pdfbox/util/PDFTextStripper.html

for an example implementation see https://github.com/WolfgangFahl/pdfindexer

the testcase TestPdfIndexer.testExtracting shows how it works

Wolfgang Fahl
  • 15,016
  • 11
  • 93
  • 186
1

On my Macintosh systems, I find that "Adobe Reader" does a reasonably good job. I created an alias on my Desktop that points to the "Adobe Reader.app", and all I do is drop a pdf-file on the alias, which makes it the active document in Adobe Reader, and then from the File-menu, I choose "Save as Text...", give it a name and where to save it, click "Save", and I'm done.

Dick Guertin
  • 747
  • 8
  • 9
  • 6
    The OP looked for a solution for *extracting text from a pdf programatically*. Your answer proposes a manual routine instead. – mkl Jan 12 '15 at 07:29
1

QuickPDF seems to be a reasonable library that should do what you want for a reasonable price.

http://www.quickpdflibrary.com/ - They have a 30 day trial.

Andrew Cash
  • 2,321
  • 1
  • 17
  • 11