How to search contents of multiple pdf files?

Question

How could I search the contents of PDF files in a directory/subdirectory? I am looking for some command line tools. It seems that grep can't search PDF files.

Grep will not work as PDF is a binary format and the text is often compressed or encoded in a variety of ways. — mark stephens, Jan 10 '11 at 07:37
Here is a GUI solution: Adobe Reader, see https://wikispaces.psu.edu/display/training/Search+for+Text+in+Multiple+PDFs+with+Adobe+Reader — Martin Thoma, Aug 01 '12 at 13:44
Related: http://unix.stackexchange.com/questions/6704/grep-pdf-files — Flow, Jun 22 '13 at 12:59
Adobe reader works fine, but it does not index; so if you have a lot of files, it will be slow. Any indexing solution? — Irina Rapoport, Jan 29 '14 at 16:40

score 270 · Answer 1 · edited Oct 31 '19 at 09:30

270

There is pdfgrep, which does exactly what its name suggests.

pdfgrep -R 'a pattern to search recursively from path' /some/path

I've used it for simple searches and it worked fine.

(There are packages in Debian, Ubuntu and Fedora.)

Since version 1.3.0 pdfgrep supports recursive search. This version is available in Ubuntu since Ubuntu 12.10 (Quantal).

edited Oct 31 '19 at 09:30

Flow

23,572
15
99
156

answered Mar 25 '11 at 15:42

Graeme

2,709
1
14
2

1

From Natty (Ubuntu 11.04) upwards (See http://packages.ubuntu.com/search?keywords=pdfgrep&searchon=names&suite=all&section=all) – Martin Thoma Aug 01 '12 at 13:34
3

@pavon `pdfgrep` does now have that recursion option, including `-R` to also follow symlinks – Tobias Kienzler Sep 29 '14 at 11:53
1

I have a problem with this tool on Debian 10. It does not find some strings which can be found with evince. Turns out to be quite unreliable. – Ohumeronen Feb 14 '21 at 16:11
1

@Ohumeronen Seven years later, problem remains. Results seem to depend on how pdf was created. So pdftotext -raw (though deprecated) seems to help. – yasd Apr 06 '21 at 17:14
As of [release 2.0](https://pdfgrep.org/news/pdfgrep-2.0-released.html) `pdfgrep` has a `--cache` option to drastically speed up multiple searches on the same files. – Stefan Schmidt Oct 24 '22 at 00:51

score 242 · Accepted Answer · edited Apr 22 '16 at 13:51

242

Your distribution should provide a utility called pdftotext:

find /path -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "your pattern"' \;

The "-" is necessary to have pdftotext output to stdout, not to files. The --with-filename and --label= options will put the file name in the output of grep. The optional --color flag is nice and tells grep to output using colors on the terminal.

(In Ubuntu, pdftotext is provided by the package xpdf-utils or poppler-utils.)

This method, using pdftotext and grep, has an advantage over pdfgrep if you want to use features of GNU grep that pdfgrep doesn't support. Note: pdfgrep-1.3.x supports -C option for printing line of context.

edited Apr 22 '16 at 13:51

m-ric

5,621
7
38
51

answered Jan 10 '11 at 03:43

sjr

9,769
1
25
36

1

@Kurt Pfeifle The edit "(Edit by -kp-)" you made does no not work since `grep` filters the printed file names. – Raphael Ahrens Aug 13 '13 at 09:07
@sjr no, while the `pdfgrep` solution is good for really quick and simple searches, often I want to get some context, as a single line won't be helpful enough -- so as I added to this answer: For instance, you can add the -C5 option before "your pattern" to include 5 lines of context to the output -- pdfgrep does not support this – Colin D Bennett Oct 14 '13 at 18:58
oh that's cool, glad to know there are advantages to this even though it is much less obvious to most people wtf it is doing – sjr Oct 16 '13 at 04:30
2

@sjr Just for the record: I am using Ubuntu 12.10 and `pdfgrep` is useless, it reports a tremendous amount of rubbish on files it cannot handle. Your solution on the other hand helped. So please don't delete it, even after 3 years it is still helpful! – Ali Jun 12 '14 at 21:13
I was able to use it also in cygwin, altough to make it a function with parameter I had to make the "your_pattern" become '$1' – Koshmaar Apr 17 '15 at 12:28

Glutanimate · Answer 3 · 2014-01-25T17:08:06.507

35

Recoll is a fantastic full-text GUI search application for Unix/Linux that supports dozens of different formats, including PDF. It can even pass the exact page number and search term of a query to the document viewer and thus allows you to jump to the result right from its GUI.

Recoll also comes with a viable command-line interface and a web-browser interface.

edited Jan 25 '14 at 17:08

answered May 29 '13 at 11:59

Glutanimate

1,692
20
21

2

@Glutanimate It would help (me and possibly others too) if you could add an **example** pertaining to the original question *(command line tool for search of multiple pdf's):* I would also like to see how to perform a **wildcard search** and how to search the **current directory including all subdirectories**. How would that look with `recoll / xapian` in the command line (non-GUI)? Thanks! – nutty about natty Aug 31 '15 at 09:41
@LeszekŻarna Perhaps you could post the example you tested? – nutty about natty Aug 31 '15 at 09:42
The `recoll` [user manual](http://www.lesbonscomptes.com/recoll/usermanual/usermanual.html) might contain some pointers, but offers a rather technical and "off-topic" read... – nutty about natty Aug 31 '15 at 09:48
1

@nutty: recoll -t -q dir:`pwd` ext:pdf 'neuro*' -- stackoverflow ate the backticks around pwd. – medoc Feb 25 '16 at 18:12

score 17 · Answer 4 · answered May 22 '14 at 04:40

17

My actual version of pdfgrep (1.3.0) allows the following:

pdfgrep -HiR 'pattern' /path

When doing pdfgrep --help:

H: Print the file name for each match.
i: Ignore case distinctions.
R: Search directories recursively.

It works well on my Ubuntu.

answered May 22 '14 at 04:40

arkhi

488
3
14

score 15 · Answer 5 · answered Jul 29 '19 at 09:06

There is another utility called ripgrep-all, which is based on ripgrep.

It can handle more than just PDF documents, like Office documents and movies, and the author claims it is faster than pdfgrep.

Command syntax for recursively searching the current directory, and the second one limits to PDF files only:

rga 'pattern' .
rga --type pdf 'pattern' .

score 7 · Answer 6 · edited Sep 30 '13 at 15:01

7

I made this destructive small script. Have fun with it.

function pdfsearch()
{
    find . -iname '*.pdf' | while read filename
    do
        #echo -e "\033[34;1m// === PDF Document:\033[33;1m $filename\033[0m"
        pdftotext -q -enc ASCII7 "$filename" "$filename."; grep -s -H --color=always -i $1 "$filename."
        # remove it!  rm -f "$filename."
    done
}

edited Sep 30 '13 at 15:01

Ivo

3,481
1
25
29

answered Jun 10 '11 at 15:48

phil

71
1
1

3

+1. But instead of the `$filename.` you should pipe it into `grep`. – Raphael Ahrens Aug 13 '13 at 09:06

score 4 · Answer 7 · answered Sep 26 '14 at 18:13

4

I like @sjr's answer however I prefer xargs vs -exec. I find xargs more versatile. For example with -P we can take advantage of multiple CPUs when it makes sense to do so.

find . -name '*.pdf' | xargs -P 5 -I % pdftotext % - | grep --with-filename --label="{}" --color "pattern"

answered Sep 26 '14 at 18:13

Deian

1,237
15
31

1

interesting point about `xargs`' parallel-processing capability. Note that your `--label` option-argument will be _literally_ `{}`, because the `grep` command is now no longer executed in the context of `find`'s `exec`. – mklement0 Jan 24 '17 at 12:56

score 2 · Answer 8 · answered Jun 24 '12 at 14:04

2

I had the same problem and thus I wrote a script which searches all pdf files in the specified folder for a string and prints the PDF files wich matched the query string.

Maybe this will be helpful to you.

You can download it here

answered Jun 24 '12 at 14:04

Paul Weibert

194
1
4

maybe useful to put the script in the comment? – baxx Dec 22 '15 at 20:21
i tried your script and it turns out much slower than the `pdfgrep` solution or sjr's one-liner, and it left me with an ongoing process using 100% of a CPU thread even after I Ctrl-C to terminate it. – Jason Jul 21 '18 at 06:49

score 2 · Answer 9 · answered Jan 24 '13 at 17:17

2

If You want to see file names with pdftotext use following command:

find . -name '*.pdf' -exec echo {} \; -exec pdftotext {} - \; | grep "pattern\|pdf"

answered Jan 24 '13 at 17:17

Aleksey Kontsevich

4,671
4
46
101

score 2 · Answer 10 · answered Jan 02 '16 at 22:07

2

First convert all your pdf files to text files:

for file in *.pdf;do pdftotext "$file"; done

Then use grep as normal. This is especially good as it is fast when you have multiple queries and a lot of PDF files.

answered Jan 02 '16 at 22:07

Martin Thoma

124,992
159
614
958

This, when done in combination with `ag` https://github.com/ggreer/the_silver_searcher . Capable to parse at psychedeliks Gb by microseconds. Flat files for life – NVRM Mar 10 '18 at 17:13

Craig · Answer 11 · 2014-04-13T22:28:38.487

1

There is an open source common resource grep tool crgrep which searches within PDF files but also other resources like content nested in archives, database tables, image meta-data, POM file dependencies and web resources - and combinations of these including recursive search.

The full description under the Files tab pretty much covers what the tool supports.

I developed crgrep as an opensource tool.

edited Apr 13 '14 at 22:28

answered Oct 23 '13 at 12:04

Craig

1,383
12
9

Craig - do you have a connection to that project? If so, you should state it in your answer. I say this because you've just posted a virtually identical answer to two other old questions ... – Stephen C Nov 10 '13 at 07:19
Updated post to clarify that I'm the author of crgrep – Craig Apr 13 '14 at 22:29

score 0 · Answer 12 · answered Jan 10 '11 at 03:43

You need some tools like pdf2text to first convert your pdf to a text file and then search inside the text. (You will probably miss some information or symbols).

If you are using a programming language there are probably pdf libraries written for this purpose. e.g. http://search.cpan.org/dist/CAM-PDF/ for Perl

score 0 · Answer 13 · answered Jan 10 '11 at 09:09

0

try using 'acroread' in a simple script like the one above

answered Jan 10 '11 at 09:09

acathur

11

luttztfz · Answer 14 · 2022-02-06T16:49:56.123

Thanks for all the good ideas here!

I tried the xargs method, but as pointed out here, xargs will make it impossible (or very hard) to include printing the actual file name...

So I tried the whole thing with GNU parallel.

parallel "pdftotext -q {} - | grep --with-filename --label='['{}']' --color=always --context=5 'pattern'" ::: *.pdf

This prints not only the pattern, but with --context=5 also 5 lines above and below as well for context.
With -q pdftotext won't print any error messages or warnings (quiet).
I use brackets [] as labels instead of braces {}. If you wanted braces --label='{'{}'}' will make that happen. Note that {} is replaced by the actual filename by GNU parallel, e.g. 'Example portable document file name with spaces.pdf' ({} is already using single quotes ').
By using --label={} only the filename will be printed, which may be the favored way of displaying the filename.
I also noticed that the output was without color when I tried it, except when forcing it by adding --color=always with grep.
It may be useful to add --ignore-case to the grep command for a case-insensitive keyword search.

If all PDF files should be processed recursively, including all sub-directories in the current directory (.), this can be accomplished through find:

find . -type f -iname '*.pdf' -print0 | parallel -0 "pdftotext -q {} - | grep --with-filename --label='['{}']' --color=always --context=5 'pattern'"

With find, -iname '*.pdf' acts case-insensitive. With -name '*.pdf' only lower-case .pdf files will be included (the normal case). Since I sometimes also encountered Windows PDF-files with an upper-case .PDF file extension, I tend to prefer -iname...
The above command also works with the -print find option (instead of -print0), so it will be line-based (one file name per line), then -0 (NUL delimiter) must be omitted from the parallel command.
Again, including --ignore-case in the grep command will make the search case-insensitive.

As a general recommendation when playing with the whole command line, parallel --dry-run will print which commands would be executed.

$ find . -type f -iname '*.pdf' -print0 | parallel --dry-run -0 "pdftotext -q {} - | grep --with-filename --label='['{}']' --color=always --ignore-case --context=5 'pattern'"
pdftotext -q ./test PDF file 1.pdf - | grep --with-filename --label='['./test PDF file 1.pdf']' --color=always --ignore-case --context=5 'pattern'
pdftotext -q ./subdir1/test PDF file 2.pdf - | grep --with-filename --label='['./subdir1/test PDF file 2.pdf']' --color=always --ignore-case --context=5 'pattern'
pdftotext -q ./subdir2/test PDF file 3.pdf - | grep --with-filename --label='['./subdir2/test PDF file 3.pdf']' --color=always --ignore-case --context=5 'pattern'

score -1 · Answer 15 · edited Jun 28 '22 at 14:45

Use pdfgrep:

pdfgrep -HinR 'FWCOSP' DatenModel/

In this command I'm searching for the word FWCOSP inside the folder DatenModel/.

As you can see in the output you can have the file name wit the line numbers:

The options I'm using are:

-i : Ignores, case for matching
-H : print the file name for each match
-n : prefix each match with the number of the page where it is found
-R : same as -r, but it also follows all symlinks.

How to search contents of multiple pdf files?

15 Answers15

Linked