Regex to capture page number from filename

Question

I have document page images named (for example) as follows:

“2020-07-24 07;17;09 - ABCD - 12345-67890 (14 Main St) - 01 [Declaration 1].png”
“2020-07-24 07;17;09 - ABCD - 12345-67890 (14 Main St) - 02 [Declaration 2].png”
“2020-07-24 07;17;09 - ABCD - 12345-67890 (14 Main St) - 07 [Fire].png”
“2020-07-24 07;17;09 - ABCD - 12345-67890 (14 Main St) - 12 [Fungi etc].png”

I want to capture ONLY the page numbers, without preceding zeros (1, 2, 7, 12 in this example). Based on code I saw here, I thought maybe something like this might take care of it:

 - 0*\d+.*\.(?:jpe?g|png|tiff?)$(?!(?:0*)\d+)

…but, it did not. Any other suggestions?

Thanks @JvdV, looks like you might have posted your comment before my clarification, but this doesn’t work when there’s no other text following the page number. — mazeckenrode, Jul 25 '20 at 19:38

The fourth bird · Accepted Answer · 2020-07-25T21:29:38.233

3

You could use a capturing group for the digits:

- 0*(\d+) \[[^][]*]\.(?:jpe?g|png|tiff?)\b

Explanation

- 0* Match - a space and 0+ times a zero
(\d+) Capture group 1, match 1+ digits
[[^][]*] Match a space and from [ till ]
\.(?:jpe?g|png|tiff?)\b Match a dot and one of the alternatives

Regex demo

To capture the last digits without leading zeroes after the last occurrence of space dash space, you could use a negative lookahead:

 - 0*(\d+)(?!.* - ).*\.(?:jpe?g|png|tiff?)$

Regex demo

edited Jul 25 '20 at 21:29

answered Jul 25 '20 at 16:05

The fourth bird

154,723
16
55
70

Thanks. This works for the specific filename examples I provided, but I should have clarified that the various filenames I deal with may or may not have additional text after the page numbers (` [text]` or ` text` or whatever), though that text should not include any additional occurence of ` - ` (space dash space). So basically, I’m trying to capture integers which immediately follow the last ` - ` , but which may or may not be followed by a space and additional text before the `.(?:jpe?g|png|tiff?)`. (Don’t know why some of my backticks for code aren’t doing their job.) – mazeckenrode Jul 25 '20 at 19:19
Try it like this `0*(\d+)(?!.* - ).*\.(?:jpe?g|png|tiff?)\b` See https://regex101.com/r/Jlceba/1 – The fourth bird Jul 25 '20 at 20:29
1

That’s it, ` - 0*(\d+)(?!.* - ).*\.(?:jpe?g|png|tiff?)$` works for all my test cases! Thanks! And sorry for the inadequacy of my initial examples. – mazeckenrode Jul 26 '20 at 00:30

JvdV · Answer 2 · 2020-07-25T20:44:05.407

2

So it looks like you want to end up at the last hyphen. Try:

-\h*(?!.*-)0*(\d+)

See the demo

-\h* - Match a literal hypen and zero or more horizontal whitespaces.
(?!.*-) - A negativ lookahead for zero or more characters and hyphen.
0* - Zero or more zeroes.
(\d+) - Capture at least a single digit into capture group 1.

End note: Please give credit where credit is due. Your question did not have the necessary details given later through comments. This answer is far more detailed based on what you provided in the OP.

edited Jul 25 '20 at 20:44

answered Jul 25 '20 at 19:54

JvdV

70,606
8
39
70

1

This also works for my test cases, though The fourth bird’s final answer is more specific to what I’m looking for, but thanks for the suggestions, and my apologies for having left other relevant examples out of my original post. – mazeckenrode Jul 26 '20 at 00:32

Regex to capture page number from filename

2 Answers2