Regex for capturing smallest group

Question

I am trying to capture an ID for a PDF Page object that looks like this :

4 0 obj
<<
/Type /Page /
...
>>
endobj

The ID is this 'ID 0 obj'. The problem is that my file has multiple objects and so the following pattern captures from the first object declaration to the first instance of a Page object :

preg_match_all("/([0-9]+) 0 obj.+?\/Page[ \n]*?\//s", $input_lines, output_array);

Here is a sample of my file if you want to try it out, you will see that are multiple objects that include the word 'Page' :

%PDF-1.3
%¦¦¦¦

1 0 obj
<<
/Type /Catalog /AcroForm << /Fields [12 0 R 13 0 R] /NeedAppearances false  /SigFlags 3 /Version /1.7 /Pages 3 0 R /Names << >> /ViewerPreferences << /Direction /L2R >> /PageLayout /SinglePage /PageMode /UseNone /OpenAction [0 0 R /FitH null] /DR << /Font << /F1 14 0 R >> >> /DA (/F1 0 Tf 0 g) /Q 0 >> /Perms << /DocMDP 11 0 R >>
/Outlines 2 0 R
/Pages 3 0 R
>>
endobj

2 0 obj
<<
/Type /Outlines
/Count 0
>>
endobj

3 0 obj
<<
/Type /Pages
/Count 2
/Kids [ 4 0 R 6 0 R ]
>>
endobj

4 0 obj
<<
/Type /Page
/Parent 3 0 R
/Resources <<
/Font <<
/F1 9 0 R
>>
/ProcSet 8 0 R
>>
/MediaBox [0 0 612.0000 792.0000]
/Contents 5 0 R
>>
endobj

5 0 obj
<< /Length 1074 >>
stream
2 J
BT
0 0 0 rg
/F1 0027 Tf
57.3750 722.2800 Td
( A Simple PDF File ) Tj
ET
BT
/F1 0010 Tf

What should I change to not make it greedy ?

EDIT : Clarifications

I forgot to mention that I need to capture all of the Page object IDs.
As some people told me to use more specific regex, I have to say that this is not a formal example of how objects are build and this one is also possible. You can see that the spaces are not mendatory and that there can be multiple tags before the Page '/Type /Page' tag.

Example :

4 0 obj
<< /UselessTag/Type/Page/
...
>>
endobj

There are tags called Pages, PageLayout, SiglePage and I don't want to capture them.

Are after 1, 2 or 3 matches here? There are several "records" with `/Pages`. — Wiktor Stribiżew, Jul 12 '17 at 13:26
I am not sure now: what is the marker string for a Page entry? `/Pages`, `/Page` or both? Can we assume that a *whole word `Page` is a marker*? Or is it `/Page` or `Page/`? — Wiktor Stribiżew, Jul 12 '17 at 13:47
I edited my question to clarify everything that was asked in the comments. — Shashimee, Jul 12 '17 at 13:51
Ok, I think I got it. The marker is `/Type` followed by 0+ spaces, then `/Page` followed with whitespace. — Wiktor Stribiżew, Jul 12 '17 at 13:55
Yes, and there can be line breaks after **/Type** haha. This regex is exhausting me... — Shashimee, Jul 12 '17 at 13:57
It is OK, see my answer. Its performance might be improved, let know via a comment if it is too slow for you. — Wiktor Stribiżew, Jul 12 '17 at 14:04

Wiktor Stribiżew · Accepted Answer · 2017-09-25T06:12:39.520

You may use

'~^(\d+) 0 obj(?:(?!^\d+ 0 obj$).)*?\/Type\s*\/Page\s.*?endobj$~sm'

See the regex demo

Details:

^ - start of a line anchor (as m modifier makes ^ match start of a line and not of a whole string)
(\d+) 0 obj - 1 or more digits (captured into Group 1), then space, 0, space and an obj substring
(?:(?!^\d+ 0 obj$).)*? - a tempered greedy token that matches any char (.) that does not start a ^\d+ 0 obj$ pattern, as few times as possible
\/Type\s*\/Page\s - /Type, 0+ whitespaces (replace \s with \h to only match horizontal whitespace), /Page and then a whitespace
.*? - any 0+ chars as few as possible up to the first occurrence of
endobj - endobj followed with...
$ - the end of line position.

score 1 · Answer 2 · answered Jul 12 '17 at 14:00

I wouldn't work with regular expressions on PDF. There are several conditions, where this approach will fail.

The page object is inside an object stream (and therefor packed, most probably by a Deflate algorithm) (This is allowed with PDF version 1.5 and up)
Incremental updates inside the PDF document can lead to double hits on the same page
The marker /Page is not inside the dictionary, which you want to match, but inside an indirect object (never seen, but theoretically possible). E.g you have:

5 0 obj
<< /Type 6 0 R ....>>
endobj     
6 0 obj
/Page
endobj

Note: You also cannot expect, that each page is written in the order inside the pdf document, as you see it in the viewer.

But if you really must do it in that way, i would first match the pdf object with

/([0-9]+) 0 obj(.+?)endobj/

and would search in the second matched string for

//Type\s*\Page[\s>]/

The optional matching for > at the end is important, because you need to be able to match also "/Type/Page>>", where /Type/Page is the last entry in the pdf dictionary.

Bernhard · Answer 3 · 2017-07-12T13:50:18.203

0

You can put in an ungreedy Questionmark to a specific Quantifier:

Example:

 \(.*\)

Matches:

test (test)test(test)test(test) test

Example:

 \(.*?\)

Matches:

test (test) test(test)test(test)test

edited Jul 12 '17 at 13:50

answered Jul 12 '17 at 13:23

Bernhard

1,852
11
19

You are wrong, `U` modifier does not make a regex ungreedy. It *swaps* greediness of the quantifiers used in the pattern. – Wiktor Stribiżew Jul 12 '17 at 13:26
If you test your solution, you will see [it matches too much](https://regex101.com/r/Bxld3n/1). – Wiktor Stribiżew Jul 12 '17 at 13:48

score 0 · Answer 4 · answered Jul 12 '17 at 13:30

0

Try more specific regex so it does not match unneeded part of text.

preg_match_all("/([0-9]+?) 0 obj\n\<\<\n\/Type\s\/Page[ \n]*?\//s", $input_lines, output_array);

Proof: https://regex101.com/r/HjyQpS/1

answered Jul 12 '17 at 13:30

Māris Kiseļovs

16,957
5
41
48

score 0 · Answer 5 · answered Jul 12 '17 at 13:30

0

This should work:

(\d+) 0 obj[^>]+/Page$

Regex101 demo

answered Jul 12 '17 at 13:30

Oleksii Filonenko

1,551
1
17
27

score 0 · Answer 6 · answered Jul 12 '17 at 14:17

0

Use this regular expression:

/\d+\s0\sobj.+endobj/smU

Note that the modifier U makes the match non-greedy. See the matching example here:https://www.tinywebhut.com/regex/8

answered Jul 12 '17 at 14:17

Saral

1,087
1
8
18

Regex for capturing smallest group

6 Answers6