I am trying to capture an ID for a PDF Page object that looks like this :
4 0 obj
<<
/Type /Page /
...
>>
endobj
The ID is this 'ID 0 obj'. The problem is that my file has multiple objects and so the following pattern captures from the first object declaration to the first instance of a Page object :
preg_match_all("/([0-9]+) 0 obj.+?\/Page[ \n]*?\//s", $input_lines, output_array);
Here is a sample of my file if you want to try it out, you will see that are multiple objects that include the word 'Page' :
%PDF-1.3
%¦¦¦¦
1 0 obj
<<
/Type /Catalog /AcroForm << /Fields [12 0 R 13 0 R] /NeedAppearances false /SigFlags 3 /Version /1.7 /Pages 3 0 R /Names << >> /ViewerPreferences << /Direction /L2R >> /PageLayout /SinglePage /PageMode /UseNone /OpenAction [0 0 R /FitH null] /DR << /Font << /F1 14 0 R >> >> /DA (/F1 0 Tf 0 g) /Q 0 >> /Perms << /DocMDP 11 0 R >>
/Outlines 2 0 R
/Pages 3 0 R
>>
endobj
2 0 obj
<<
/Type /Outlines
/Count 0
>>
endobj
3 0 obj
<<
/Type /Pages
/Count 2
/Kids [ 4 0 R 6 0 R ]
>>
endobj
4 0 obj
<<
/Type /Page
/Parent 3 0 R
/Resources <<
/Font <<
/F1 9 0 R
>>
/ProcSet 8 0 R
>>
/MediaBox [0 0 612.0000 792.0000]
/Contents 5 0 R
>>
endobj
5 0 obj
<< /Length 1074 >>
stream
2 J
BT
0 0 0 rg
/F1 0027 Tf
57.3750 722.2800 Td
( A Simple PDF File ) Tj
ET
BT
/F1 0010 Tf
What should I change to not make it greedy ?
EDIT : Clarifications
- I forgot to mention that I need to capture all of the Page object IDs.
- As some people told me to use more specific regex, I have to say that this is not a formal example of how objects are build and this one is also possible. You can see that the spaces are not mendatory and that there can be multiple tags before the Page '/Type /Page' tag.
Example :
4 0 obj
<< /UselessTag/Type/Page/
...
>>
endobj
- There are tags called Pages, PageLayout, SiglePage and I don't want to capture them.