3

I am trying to capture an ID for a PDF Page object that looks like this :

4 0 obj
<<
/Type /Page /
...
>>
endobj

The ID is this 'ID 0 obj'. The problem is that my file has multiple objects and so the following pattern captures from the first object declaration to the first instance of a Page object :

preg_match_all("/([0-9]+) 0 obj.+?\/Page[ \n]*?\//s", $input_lines, output_array);

Here is a sample of my file if you want to try it out, you will see that are multiple objects that include the word 'Page' :

%PDF-1.3
%¦¦¦¦

1 0 obj
<<
/Type /Catalog /AcroForm << /Fields [12 0 R 13 0 R] /NeedAppearances false  /SigFlags 3 /Version /1.7 /Pages 3 0 R /Names << >> /ViewerPreferences << /Direction /L2R >> /PageLayout /SinglePage /PageMode /UseNone /OpenAction [0 0 R /FitH null] /DR << /Font << /F1 14 0 R >> >> /DA (/F1 0 Tf 0 g) /Q 0 >> /Perms << /DocMDP 11 0 R >>
/Outlines 2 0 R
/Pages 3 0 R
>>
endobj

2 0 obj
<<
/Type /Outlines
/Count 0
>>
endobj

3 0 obj
<<
/Type /Pages
/Count 2
/Kids [ 4 0 R 6 0 R ]
>>
endobj

4 0 obj
<<
/Type /Page
/Parent 3 0 R
/Resources <<
/Font <<
/F1 9 0 R
>>
/ProcSet 8 0 R
>>
/MediaBox [0 0 612.0000 792.0000]
/Contents 5 0 R
>>
endobj

5 0 obj
<< /Length 1074 >>
stream
2 J
BT
0 0 0 rg
/F1 0027 Tf
57.3750 722.2800 Td
( A Simple PDF File ) Tj
ET
BT
/F1 0010 Tf

What should I change to not make it greedy ?

EDIT : Clarifications

  • I forgot to mention that I need to capture all of the Page object IDs.
  • As some people told me to use more specific regex, I have to say that this is not a formal example of how objects are build and this one is also possible. You can see that the spaces are not mendatory and that there can be multiple tags before the Page '/Type /Page' tag.

Example :

4 0 obj
<< /UselessTag/Type/Page/
...
>>
endobj
  • There are tags called Pages, PageLayout, SiglePage and I don't want to capture them.
Shashimee
  • 256
  • 3
  • 20

6 Answers6

1

You may use

'~^(\d+) 0 obj(?:(?!^\d+ 0 obj$).)*?\/Type\s*\/Page\s.*?endobj$~sm'

See the regex demo

Details:

  • ^ - start of a line anchor (as m modifier makes ^ match start of a line and not of a whole string)
  • (\d+) 0 obj - 1 or more digits (captured into Group 1), then space, 0, space and an obj substring
  • (?:(?!^\d+ 0 obj$).)*? - a tempered greedy token that matches any char (.) that does not start a ^\d+ 0 obj$ pattern, as few times as possible
  • \/Type\s*\/Page\s - /Type, 0+ whitespaces (replace \s with \h to only match horizontal whitespace), /Page and then a whitespace
  • .*? - any 0+ chars as few as possible up to the first occurrence of
  • endobj - endobj followed with...
  • $ - the end of line position.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1

I wouldn't work with regular expressions on PDF. There are several conditions, where this approach will fail.

  1. The page object is inside an object stream (and therefor packed, most probably by a Deflate algorithm) (This is allowed with PDF version 1.5 and up)
  2. Incremental updates inside the PDF document can lead to double hits on the same page
  3. The marker /Page is not inside the dictionary, which you want to match, but inside an indirect object (never seen, but theoretically possible). E.g you have:
5 0 obj
<< /Type 6 0 R ....>>
endobj     
6 0 obj
/Page
endobj

Note: You also cannot expect, that each page is written in the order inside the pdf document, as you see it in the viewer.

But if you really must do it in that way, i would first match the pdf object with

/([0-9]+) 0 obj(.+?)endobj/

and would search in the second matched string for

//Type\s*\Page[\s>]/

The optional matching for > at the end is important, because you need to be able to match also "/Type/Page>>", where /Type/Page is the last entry in the pdf dictionary.

PatrickF
  • 594
  • 2
  • 11
0

You can put in an ungreedy Questionmark to a specific Quantifier:

Example:

 \(.*\)

Matches:

test (test)test(test)test(test) test

Example:

 \(.*?\)

Matches:

test (test) test(test)test(test)test

Bernhard
  • 1,852
  • 11
  • 19
0

Try more specific regex so it does not match unneeded part of text.

preg_match_all("/([0-9]+?) 0 obj\n\<\<\n\/Type\s\/Page[ \n]*?\//s", $input_lines, output_array);

Proof: https://regex101.com/r/HjyQpS/1

Māris Kiseļovs
  • 16,957
  • 5
  • 41
  • 48
0

This should work:

(\d+) 0 obj[^>]+/Page$

Regex101 demo

Oleksii Filonenko
  • 1,551
  • 1
  • 17
  • 27
0

Use this regular expression:

/\d+\s0\sobj.+endobj/smU

Note that the modifier U makes the match non-greedy. See the matching example here:https://www.tinywebhut.com/regex/8

Saral
  • 1,087
  • 1
  • 8
  • 18