0

I want to make a shell script for searching pattern in pdf files (to make them kind of corpus for myself!!)

I stole the following snippet from here

How to search contents of multiple pdf files?

find /path/to/folder -name '*.pdf' | xargs -P 6 -I % pdftotext % - | grep -C1 --color "pattern"

and the output looks like this

--
--
small deviation of γ from the average value  0.33 triggers
a qualitative difference in the evolution pattern, even if the

Can I make this command to print filename?

It doesn't have to be a "one-liner".

Thank you.

Community
  • 1
  • 1
omyojj
  • 59
  • 8

2 Answers2

1

Not much. Just split the command into a loop.

find /path/to/folder -name '*.pdf' | while read file
do
echo "$file"
pdftotext "$file" | grep -C1 --color "pattern" && echo "$file"
done

EDIT: I just noticed the example included a parallel xargs command. This is not impossible to solve in a loop. You can write the pdftotext & grep command into a function and then use xargs

EDIT2: only print out file when there is a match

it might look something like this:

#!/bin/bash

files=$(find /path/to/folder -name '*.pdf')

function PDFtoText
{

file="$1"

if [ "$#" -ne "1" ]
then
    echo "Invalid number of input arguments"
    exit 1
fi

pdftotext "$file" | grep -C1 --color "pattern" && echo "$file"

}
export -f PDFtoText


printf "%s\n" ${files[@]} | xargs -n1 -P 6 -I '{}' bash -c 'PDFtoText "$@" || exit 255' arg0 {}

if [[ $? -ne 0 ]]
then
exit 1
fi
nln
  • 56
  • 6
  • Tried this and it prints out all file names. Can I print files that have at least one match? – omyojj Apr 23 '15 at 15:45
  • Yeah, that's a good point. A simple && command after the grep does that trick. Edited. – nln Apr 23 '15 at 15:54
0

Why don't use something like

find /path/to/folder/ -type f -name '*.pdf' -print0 | \
  xargs -0 -I{} \
  sh -c 'echo "===== file: {}"; pdftotext "{}" - | grep -C1 --color "pattern"'

It always prints the filename. Do you think it's an acceptable compromise? Otherwise the echo part can be moved after the grep with a && as suggested before.

I prefer to use -print0 in combination with -0 just to deal with filenames with spaces.

I'd remove the -P6 option because the output of the 6 processes in parallel could be mixed.

MaxChinni
  • 1,206
  • 14
  • 21