how to couple xargs with pdftotext converter to search inside multiple pdf files

Question

I am making a script which is supposed to search inside all the pdf files in a directory. I have found one converted named "pdftotext" which enables me to use grep on pef files, but I am able to run it only with one file. When I want to run it over all the files present in directory then it fails. Any suggestions ?

This works:for a single file

pdftotext my_file.pdf - | grep 'hot'

This fails: for searching pdf files and converting to text and greping

SHELL PROMPT>find ~/.personal/tips -type f -iname "*" | grep -i "*.pdf" | xargs pdftotext |grep admin
pdftotext version 3.00
Copyright 1996-2004 Glyph & Cog, LLC
Usage: pdftotext [options] <PDF-file> [<text-file>]
  -f <int>          : first page to convert
  -l <int>          : last page to convert
  -layout           : maintain original physical layout
  -raw              : keep strings in content stream order
  -htmlmeta         : generate a simple HTML file, including the meta information
  -enc <string>     : output text encoding name
  -eol <string>     : output end-of-line convention (unix, dos, or mac)
  -nopgbrk          : don't insert page breaks between pages
  -opw <string>     : owner password (for encrypted files)
  -upw <string>     : user password (for encrypted files)
  -q                : don't print any messages or errors
  -cfg <string>     : configuration file to use in place of .xpdfrc
  -v                : print copyright and version info
  -h                : print usage information
  -help             : print usage information
  --help            : print usage information
  -?                : print usage information
SHELL PROMPT 139>

Charles Duffy · Accepted Answer · 2015-03-24T12:13:24.423

xargs is the wrong tool for this job: find does everything you need built-in.

find ~/.personal/tips \
    -type f \
    -iname "*.pdf" \
    -exec pdftotext '{}' - ';' \
  | grep hot

That said, if you did want to use xargs for some reason, correct usage would look something like...

find ~/.personal/tips \
    -type f \
    -iname "*.pdf" \
    -print0 \
  | xargs -0 -J % -n 1 pdftotext % - \
  | grep hot

Note that:

The find command uses -print0 to NUL-delimit its output
The xargs command uses -0 to NUL-delimit its input (which also turns off some behavior which would lead to incorrect handling of filenames with whitespace in their names, literal quote characters, etc).
The xargs command uses -n 1 to call pdftotext once per file
The xargs command uses -J % to specify a sigil for where the replacement should happen, and uses that % in the pdftotext command line appropriately.

Petr Skocik · Answer 2 · 2015-03-24T12:33:47.303

3

find . -name '*.pdf' -print0 | xargs -0 -n1 -I '{}' pdftotext '{}' -

By default, xargs will try to fit as many lines on the command line for pdftotext as possible. You don't want that. What you want is one file per invocation followed by '-'. This you can achieve with -n1 (limit to one argument per invocation) and -I '{}' (make {} to be the placeholder for where the argument will fit).

The -print0 option to find coupled with the -0 options to xargs makes both use '\0' (null bytes) instead of newlines ('\n') as argument separators.

Xargs with -n1 and -I{} used like this is pretty much semantically equivalent to the find -exec as recommended by Charles Duffy. Xargs has the advantage that can make use of multicore processors (it can run multiple instances of pdftotext at a time; you can configure how many with the -P switch).

edited Mar 24 '15 at 12:33

answered Mar 24 '15 at 12:08

Petr Skocik

58,047
6
95
142

I am getting below error SHELL PROMPT>find ~/.personal/tips/pdf -name '*.pdf' -print0 | xargs -0 -n1 -I{} pdftotext {} - xargs: {}: No such file or directory – Mar 24 '15 at 12:13
1

Might be worth quoting `{}` in case anyone reading this answer is using zsh. (This is why I stick to the `%` suggested in the xargs man page; I don't use zsh myself, but no reason to create a failure mode for other folks using a major shell). – Charles Duffy Mar 24 '15 at 12:14
2

BTW, while `find`'s output can represent all filenames except those with newlines accurately without using `-print0`, the default behavior used by `xargs` for reading content is not so robust without `-0`; it tries to interpret quotes, parse whitespace, and the like; it's not a straight equivalent with a newlines-vs-NULLs swap. Using the GNU xargs extension `-d $'\n'` is wise if using xargs to read newline-delimited filenames, as this disables the other behaviors. – Charles Duffy Mar 24 '15 at 12:21
@Charles Duffy Good to know. I've been sticking to `-print0` (find) + `-0` mostly (xargs), trying to steer away from those dark corner cases of UNIX command line processing. – Petr Skocik Mar 24 '15 at 12:31

K J · Answer 3 · 2022-05-22T00:28:32.610

This is a Linux question thus primarily how to use the command line to search all pdf files for "hot" in Linux.

For windows users you would need a slightly different syntax using for or forfiles to recurse the directories with for example something like :-

forfiles /P "C:\Users\WDAGUtilityAccount\Desktop\SandBox" /S /M *.pdf /C "cmd /c pdftotext @file  - |find /I \" hot \"

However that would generate reams of mixed output including many pdf errors mixed in with valid outputs such as

Syntax Warning: Invalid Font Weight
Syntax Warning: Invalid Font Weight
identifies hot (frequently executed) bytecode sequences, records
their time in hot loops. Even in dynamically typed languages, we
....
.....

However there is a much simpler method and that is (1st ensuring you installed a pdf iFilter) simply add "hot" to a file search, so here we find there are 26 results in all the sandbox folders.

phili_b · Answer 4 · 2022-05-21T22:32:31.143

An answer for concatenate all codes found, by regex, inside each pdf and rename each pdf filename with those codes found.

Examples of codes corresponding to the regexp in the shell to be searched in the PDF files

File1.pdf:X123456
File1.pdf:A1234567
File2.pdf:X003456
File2.pdf:A0034567

So File1 and File2 files will be renamed:

X123456_A1234567_File1.pdf
X003456_A0034567_File2.pdf

The filename batch find_codes_in_pdf_and_rename.sh

To be executed chmod +x find_codes_in_pdf_and_rename.sh

Execution with output to the screen and log (The sed is to be readable under Windows with CR+LF). ./find_codes_in_pdf_and_rename.sh 2>&1 | tee | sed -u 's/$/\r/' 2>&1 | tee find_codes_in_pdf_and_rename.sh_$(date "+%Y_%m_%d_%Hh_%M_%S").log

#!/bin/bash -e


PrevFile=""
PrevCodes=""
mycmd1=""
mycmd2=""

DIRPrevFile="."
DIRFile="."

BASEFile=""

# look for files where the extension is pdf
# -print0 to have character zero to manage file name with space

find /my_path/ -iname "*.pdf" -print0 |  
# head for debug only two files, -z for print0 
# # head -z -n 2 |  
# sort, -z for print0 
sort -z| 
# exclude filename with code yet in filename, -z for print0
grep -z -v   -E   ".*[\s\.\/][A-Z][0-9]{6,7}.*" | 
# list filename:code
xargs -0 pdfgrep  -i  --only-matching  --with-filename -e "([A-Z]{1}[0-9]{6,7})"  2>&1 |
# exclude  "pdfgrep: Could not open"
tee| grep -v "pdfgrep: Could not open" |
# exclude empty lines
grep -v -e '^$' |
# find path of filename in regexp code group 1 
# and code in regexp code group 3 
# and keep only that in the list with the character ':' at the middle. 
# It's partially redundant if pdfgrep works well with --only-matching
sed --regexp-extended -e   's/(.):(.*)([A-Z][0-9]{6,7})(.*)/\1:\3/gm' |
uniq| {
   while read line
   do
       File=$( echo "$line" |cut -d\: -f1 )
       code=$( echo "$line" |cut -d\: -f2 )

       #echo File $File
       #echo code $code

       if [ "$PrevFile" == "" ]
       then
           PrevFile=$File
       fi

       if [ "$PrevFile" == "$File" ] && [ -n "$PrevCodes" ]
       then
           # concatenate all previous code to current code for the same filename 
           PrevCodes="${PrevCodes} ${code}"
       else
           PrevCodes=$code
       fi
       # uniques codes
       PrevCodes=$(echo  $PrevCodes | tr ' ' '\n' | sort | uniq | tr '\n' ' ')
  
       # echo $PrevCodes
       DIRPrevFile=`dirname "${PrevFile}"`
       DIRFile=`dirname "${File}"`
       #echo $DIRPrevFile   
   
       if [ "${DIRPrevFile}/${PrevFile}" != "${DIRFile}/${File}" ]
       then
           # computed at the previous loop of filename
           # echo "MVFake ${mycmd1}" "${mycmd2}"
           set -x
           mv "${mycmd1}" "${mycmd2}"
           set +x
        fi
   
        # to remove old PDF extension
        BASEFile=$(echo `basename  "${File}" .pdf` )
   
        # mycmd1: old filename
        mycmd1="$File"
   
        # concatenate all codes with the old filename, and replace . and space with _
        target=$(echo "${PrevCodes} ${BASEFile}" | sed "s/[ .]/_/g" ) 
        mycmd2=$(echo "${DIRPrevFile}/${target}.pdf" )

        PrevFile=$File
    done
    # echo "MVFake ${mycmd1}" "${mycmd2}"
    set -x
    mv "${mycmd1}" "${mycmd2}"
    set +x
}

Test this code against filenames containing literal backslashes. (That's not the only obvious bug, but it's one of the most consistent ones; the other glaring issues on a quick scan are mostly portability problems and so harder to reproduce). — Charles Duffy, May 21 '22 at 18:35
Ok:) I know that it isn't an argument: but it works in real at work with many directories and files on Linux Ubuntu 20 with bash version 5.0.17(1)-release (x86_64-pc-linux-gnu). Perhaps an error when I translated my french comment in English and made a typo. But when I read you your main remark it's that it's not not very portable. — phili_b, May 21 '22 at 19:15
The immediate issue that came to mind was use of `read` without `-r`; not clearing IFS for the duration of the read also breaks filenames that begin or end with whitespace. As for places where this code is needlessly specific to bash -- [`==` inside `[`](https://stackoverflow.com/questions/25846454/bin-sh-odd-string-comparison-error-unexpected-operator), for one. [`echo`](https://unix.stackexchange.com/questions/65803/why-is-printf-better-than-echo), for another. — Charles Duffy, May 21 '22 at 20:19
(even when the shell _is_ known to be bash, `echo` behavior can be inconsistent based on the value of runtime flags like `xpg_echo`). — Charles Duffy, May 21 '22 at 20:19
BTW, I also strongly recommend using `$( )` instead of backticks -- both forms are POSIX-standardized, but backticks change behavior of backslashes and other backticks within them, whereas parsing rules are identical both inside and outside of `$( )`. See https://stackoverflow.com/a/33301370/14122 for more details. — Charles Duffy, May 21 '22 at 20:21
...and while you're _mostly_ good about quoting expansions in this answer, that isn't true 100% of the time -- see the assignment to `target`; leaving out the quotes exposes one to https://stackoverflow.com/questions/29378566/i-just-assigned-a-variable-but-echo-variable-shows-something-else — Charles Duffy, May 21 '22 at 20:22
Ha yes, the space for `target` works by chance because of `sed "s/[ .]/_/g"`. I fix it. — phili_b, May 21 '22 at 22:31

how to couple xargs with pdftotext converter to search inside multiple pdf files

4 Answers4

Linked