-2

I'm trying to search pdf files from terminal. My attempt is to provide the search string from terminal. The search string can be one word, multiple words with (AND,OR) or an exact phrase. I would like to keep only one parameter for all search queries. I'll save the following command as a shell script and will call shell script as an alias from .aliases in zsh or bash shell.

Following from sjr's answer, here: search multiple pdf files.

I've used sjr's answer like this:

find ${1} -name '*.pdf' -exec sh -c 'pdftotext "{}" - |
      grep -E -m'${2}' --line-buffered --label="{}" '"${3}"' '${4}'' \;

$1 takes path

$2 limits the number of results

$3 is context parameter (it is accepting -A , -B , -C , either individually or jointly)

$4 takes search string

The issue I am facing is with $4 value. As I said earlier I want this parameter to pass my search string which can be a phrase or one word or multiple words with AND / OR relation.

I am not able to get desired results, till now I was not getting search results for phrase search until I followed Robin Green's Comment. But still phrase results are not accurate.

Edit Text from judgments:

The original rule was that you could not claim for psychiatric injury in 
negligence. There was no liability for psychiatric injury unless there was also 
physical injury (Victorian Rly Commrs v Coultas [1888]). The courts were worried 
both about fraudulent claims and that if they allowed claims, the floodgates would 
open. 

The claimant was 15 metres away behind a tram and did not see the accident but 
later saw blood on the road. She suffered nervous shock and had a miscarriage. She 
sued for negligence. The court held that it was not reasonably foreseeable that 
someone so far away would suffer shock and no duty of care was owed.

White v Chief Constable of South Yorkshire [1998] The claimants were police
officers who all had some part in helping victims at Hillsborough and suffered 
psychiatric injury. The House of Lords held that rescuers did not have a special 
position and had to follow the normal rules for primary and secondary victims. 
They were not in physical danger and not therefore primary victims. Neither could 
they establish they had a close relationship with the injured so failed as 
secondary victims. It is necessary to define `nervous shock' which is the rather 
quaint term still sometimes used by lawyers for various kinds of 
psychiatric injury...rest of para

word1 can be: shock, (nervous shock)

word2 can be: psychiatric

exact phrase: (nervous shock)

Commands

alias s='sh /path/shell/script.sh'
export p='path/pdf/files'

In terminal:

s "$p" 10 -5 "word1/|word2"          #for OR search
s "$p" 10 -5 "word1.*word2.*word3"   #for AND search
s "$p" 10 -5  ""exact phrase""       #for phrase search

Second Test Sample: An example pdf file, since command runs on pdf document: Test-File. Its 4 pages (part of 361 pg file)

If we run the following command on it, as the solution mentions:

s "$p" 10 -5 'doctrine of basic structure' > ~/desktop/BSD.txt && open ~/desktop/BSD.txt

we'll get the relevant text and 'll avoid going through entire file. Thought it would be a cool way to read what we want rather than going traditional approach.

Community
  • 1
  • 1
lawsome
  • 165
  • 2
  • 8
  • 1
    why the downvote ? want to know so that I can take care in future while asking questions. – lawsome Jan 23 '17 at 22:39
  • 2
    Single quotes will cause the quoted parameters not to be expanded (assuming you're using bash or sh), which is not what you want. You should use double quotes to quote parameters in bash or sh. Or are you using some other shell? – Robin Green Jan 23 '17 at 22:50
  • 1
    I didn't down-vote, and I too wish people would leave feedback when they do. That said, it's always worth reducing your question to an [MCVE (Minimal, Complete, and Verifiable Example)](http://stackoverflow.com/help/mcve). General tips on asking a question can be found [here](http://stackoverflow.com/help/how-to-ask). – mklement0 Jan 23 '17 at 22:56
  • @RobinGreen, thanks, I'm using zsh, haven't tried with bash. Will try further. mklement0 thanks for the links, i am reading them. – lawsome Jan 23 '17 at 23:01
  • So far you've shown samples of your code but no sample input/output. [edit] your question to include concise, testable sample input and expected output. – Ed Morton Jan 24 '17 at 03:21
  • @EdMorton , sample input here is court judgments, i was trying to skim them for law on "psychiatric injury", "nervous shock" etc. instead of reading them entirely. actually I'm quite surprised that I've made it so far. – lawsome Jan 24 '17 at 07:55
  • your AND is not a real AND because you specify the order of word, you need several grep for an AND unless you provide every combination. If not possible use maybe a sed or awk to do your job if available – NeronLeVelu Jan 24 '17 at 14:59
  • Don't tell us the background of your input, **show** us some concise, testable sample input and expected output. – Ed Morton Jan 24 '17 at 15:21
  • @Ed , just give me a min. I'll share the dropbox link of my judgments for you. – lawsome Jan 24 '17 at 16:14
  • No, don't do that, no-ones going to want to wade through more sample input that you can include in your question. Simply [edit] your question to include concise, testable sample input that accurately represents a useful sample of your real data and the expected output given that input. As mentioned above, we need a [mcve] (emphasis on **Minimal**) to be able to test a potential solution against to best be able to help you. – Ed Morton Jan 24 '17 at 16:16
  • @Ed Please ignore this as a rookie mistake. PDF text was having each para as a single line. I hope its ok now. – lawsome Jan 26 '17 at 01:01

1 Answers1

2

You need to:

  • pass a double-quoted command string to sh -c in order for the embedded shell-variable references to be expanded (which then requires escaping embedded " instances as \").

  • quote the regex with printf %q for safe inclusion in the command string - note that this requires bash, ksh, or zsh as the shell.

dir=$1
numMatches=$2
context=$3
regexQuoted=$(printf %q "$4")

find "${dir}" -type f -name '*.pdf' -exec sh -c "pdftotext \"{}\" - |
  grep -E -m${numMatches} --with-filename --label=\"{}\" ${context} ${regexQuoted}" \;

The 3 invocation scenarios would then be:

s "$p" 10 -5 'word1|word2'          #for OR search
s "$p" 10 -5 'word1.*word2.*word3'  #for AND search
s "$p" 10 -5 'exact phrase'         #for phrase search

Note that there's no need to escape | and no need to add an extra layer of double quotes around exact phrase.

Also note that I've replaced --line-buffered with --with-filename, as I assume that's what you meant (to have the matching lines prefixed with the PDF file path).


Note that with the above approach a shell instance must be created for every input path, which is inefficient, so consider rewriting your command as follows, which also obviates the need for printf %q (assume regex=$4):

find "${dir}" -type f -name '*.pdf' | 
  while IFS= read -r file; do
    pdftotext "$f" - |
      grep -E -m${numMatches} --with-filename --label="$f" ${context} "${regex}"
  done

The above assumes that your filenames have no embedded newlines, which is rarely a real-world concern. If it is, there a ways to solve the problem.

An additional advantage of this solution is that it uses only POSIX-compliant shell features, but note that the grep command uses nonstandard options.

mklement0
  • 382,024
  • 64
  • 607
  • 775