Regex ignoring linebreaks and "page layout"

Question

I have an assortment of searchable PDF files and I often search particular patterns in all of them simultaneously, using the pdfgrep command. My regex knowledge is somewhat limited and I'm not sure how to work around linebreaks and page layout.

For example, I would like to find the pattern "ignor.{0,10}layout" in each example below:

This is a rather difficult     You see, I would like to ignore
task that I am trying to       page layout and still find the
achieve.                       pattern I am looking for.

This is a rather difficult     This is because I would like to ig-
task that I am trying to       nore page layout and still find the
achieve.                       pattern I am looking for.

In both examples, I would like the first two lines to be reported by

pdfgrep -n "ignor.{0,10}layout" *

but it fails to do so because:

there is a linebreak in the middle.
in the first example, there are more than 10 characters between ignor and layout.
in the second example, ignor is cut in half.

Is there a regex that would solve this problem entirely?

The lines on the left side are definitely part of my problem, if that was your question. — Pippin, Mar 16 '19 at 19:16

score 1 · Accepted Answer · answered Mar 16 '19 at 19:26

1

pdfgrep does not have the -z flag that would be necessary to interpret newlines as zero-bytes. You can use a workaround with pdftotext, that allows to convert it to text and stream this to STDOUT, where you can pipe a regular grep call:

pdftotext SPECIFIC-FILE.pdf - | grep -Pzo "(?s)YOUR\s+QUERY"

This makes it impossible to use globbing efficiently, but you can at least iterate the glob:

for pdf in *.pdf; do echo -n "$pdf:"; pdftotext "$pdf" - | grep -Pzo "(?s)YOUR\s+QUERY"; done

Please note that if you want to match whitespaces, you almost always will want to use \s+ which matches also newlines, when -z is enabled. See this other answer for an explanation of the flags.

answered Mar 16 '19 at 19:26

Felix

1,837
9
26

`-z` would translate `"word1\nword2"` into `"word1word2"` instead of `"word1 word2"`, would it not? Also I believe that this solution doesn't spot `ignor` in the second example? – Pippin Mar 16 '19 at 19:41
I think it's almost exactly what I need though, I'm trying to work on what you submitted. All I need is to replace all `-` by nothing and all linebreaks by spaces. And maybe show a little more than just the pattern because the `.txt` file has only 1 line, so the only way to locate the pattern is to know the characters before and the characters after. – Pippin Mar 16 '19 at 19:56
1

`for pdf in *.pdf; do echo -n "$pdf:"; pdftotext "$pdf" - | sed -z 's/-//g;s/\n//g' | grep -Po ".{0,20}ignor.{0,10}layout.{0,20}"; echo ""; done` solved my problem, thank you very much! – Pippin Mar 16 '19 at 20:43

Regex ignoring linebreaks and "page layout"

1 Answers1