1

I'm trying for couple of days to create a regex for finding the correct picture by the product barcode from the pictures folder. The folder containing something like 4500 pictures. The name of the file can be in 4 formats.

  1. XXXXXX.jpg/png - short barcode unknown number of characters(numbers only).
  2. 00000(from 1 to unknow number of leading zero)XXXX(then the short barcode).jpg/png
  3. 729(as leading number)00000(from 1 to unknow number of leading zero)XXXX(then the short barcode).jpg/png
  4. 72900000XXXXXXYYY YYY YYY.jpg/png same as option 3 but with some characters(Y-represent a character).

I came up with something like that:

$i = new RegexIterator($a, '($barcode)\D*|^([0][0-9]+$barcode)\D+|(729[0-9][0-9]+$barcode)\D+|(729[0-9][0-9]+$barcode).+/', RegexIterator::GET_MATCH);
$barcode - can be 7290000232 or 0000232 or 232

But it doesn't working. Any ideas?

Nitay
  • 13
  • 3
  • Examples of what each format should and shouldn't match would be helpful as test cases – MDEV Feb 19 '14 at 09:28

2 Answers2

1

You have four cases that build up on each other:

  1. Only numbers, 1 to unlimited times: \d+
  2. 1. with leading zeros: effectively the same as 1., as zeros are numbers ;) No need for a special case here
  3. 1. optionally preceeded by 729: (?:729)?\d+ (this may already be used for the cases 1.-3.)
  4. 3. with optional characters (zero to unlimited): (?:729)?\d+(?:[a-zA-Z])*

Only the extension is left to be added:

((?:729)?\d+(?:[a-zA-Z])*\.(?:jpg|png))

Now there's one thing left. This regex would match on abc123.jpg, as 123.jpg is perfectly valid. To counter this we add ^ (this denotes the start of the input):

^((?:729)?\d+(?:[a-zA-Z])*\.(?:jpg|png))

demo @ regex101

As you insert the barcode (from case 1) yourself there are few adjustments to be made:

^((?:729)?0*?$barcode(?:[a-zA-Z])*\.(?:jpg|png))

Here we have to insert the second case with 0*? (0 zero to unlimited times, lazy).
Regarding the [a-zA-Z]: you have to decide what to allow here. Currently it only allows lowercase and uppercase letters. If you want to allow spaces (for example), then simply add them to the character group: [a-zA-Z ].

For non-latin characters you can use [\x{00BF}-\x{1FFF}\x{2C00}-\x{D7FF}a-zA-Z] (credits to this comment) as your character group, so your regex would then look like:

^((?:729)?0*?123(?:[\x{00BF}-\x{1FFF}\x{2C00}-\x{D7FF}a-zA-Z])*\.(?:jpg|png))

demo @ regex101

Community
  • 1
  • 1
KeyNone
  • 8,745
  • 4
  • 34
  • 51
  • Thanks for the fast answer, Can you tell me where should i put the variable $barcode? and i need that the [a-zA-Z] will be either english nor other languages. – Nitay Feb 19 '14 at 09:26
  • @Nitay you can't "allow languages" as you have no possibility of checking that a word is in a certain language. You can allow allow characters. Which characters exactly, that's up to you. – KeyNone Feb 19 '14 at 09:40
  • Thanks again. I want to allow after the barcode any letter that is not a number. – Nitay Feb 19 '14 at 09:43
  • @Nitay And by `any letter` you mean non-latin-letters, too? – KeyNone Feb 19 '14 at 09:44
  • this is exactly what i meant :) – Nitay Feb 19 '14 at 09:50
  • look at this example http://regex101.com/r/jQ4oX2 when i using utf charactar it doesn't work. – Nitay Feb 19 '14 at 10:07
1

From what I understand - options 1-3 are all the same (729 is a digit string same as others):

^\d+(?:jpg|png)$

With 4 you are saying 'allow word characters and whitespaces, but only if name starts with 729'. So it is now:

(?:(?:^\d+[.](?:jpg|png)$)|(?:^729\d*[\w\s]+[.](?:jpg|png)$))

\s matches spaces, '\w' matches word characters.

greg-449
  • 109,219
  • 232
  • 102
  • 145
acarlon
  • 16,764
  • 7
  • 75
  • 94
  • @Nitay What part of it, exactly? That the letters aren't allowed to be English or any other language? If so, could you explain a bit more? – SQB Feb 19 '14 at 09:39
  • Keep in mind that `\w` matches underscores (and numbers), too and `\s` matches tabs and newlines, too. May lead to some unwanted behaviour. – KeyNone Feb 19 '14 at 09:45
  • @BastiM - agreed, though he is looking searching files in a folder and filenames can't contain newlines or tabs. I was not clear as to whether `character` includes numbers or only letters which is why I did \w. – acarlon Feb 19 '14 at 09:51
  • @BastiM +1 to you though - I wasn't sure by what he meant by the $barcode. – acarlon Feb 19 '14 at 09:53