0

I'm trying to parse strings by a regex in PHP that always have this format: FooBar(,[0-9]{7}[0-9A-F]{8})+ or in other words, they have a start value/word followed by 1 or multiple entries, each entry is one comma (,), followed by 7 digits and 8 hexdec characters (digits or uppercase characters A to F).

My Regex to capture this is /^C7(,[0-9]{7}[0-9A-F]{8})+$/ which kind of works. When used in a preg_match_all it returns an array with two entries, the first as expected the input string, however, in the second array there's only one entry, the last matched chunk. (see Example)

I need to capure all the chunks matched by the capturing group. I did some research and found this answer, which seamed to be about the same issue: https://stackoverflow.com/a/2205009/2989952, So I've adjusted my regex to /(,[0-9]{7}[0-9A-F]{8})+$/, but I still only get one match. This can be tested at regex101.com. I then experimented some more, and found, that if I change the input string, to contain a space (or any not matched character for that matter), between the chunks, like this: C7,22801422CFE0F63 ,2280141C5EF0F63 ,22801402EFD0F63 ,2280138C5ED0F63 ,228024329897530 ,228023829877530 and adjust the regex once again to /(,[0-9]{7}[0-9A-F]{8})+/ it does exactly as it is intended to do!

Question: Is there a way to achieve this, matching all the chunks in this recurring group without adding whitespaces in between? If so, how?

EDIT

To illustrate the problem:
No Whitespace No Whitespace https://regex101.com/r/ilkZjD/1

enter image description here Whitespace/random chars https://regex101.com/r/mimBgz/1

Goal: Behaviour of second one, the one with whitespaces, but not adding the whitespaces (respectively the not matched characters).

EDIT 2 (hacky solution)

I kind of found a solution, considering this https://stackoverflow.com/a/3513858/2989952 Answer. The Regex /(?:,)([0-9]{7}[0-9A-F]{8})/ works for me. https://regex101.com/r/LEEFzv/1.However I'd still like a way, to match the initial FooBar. as that indicates the incoming string should be matched with this regex at all.
(I know I could simply check the string in a second regex for this, I however would love to have it in one regex)

Example:
Input: 'C7,22801422CFE0F63,2280141C5EF0F63,22801402EFD0F63,2280138C5ED0F63,228024329897530,228023829877530'

Cœur
  • 37,241
  • 25
  • 195
  • 267
wawa
  • 4,816
  • 3
  • 29
  • 52
  • `^` indicates beginning of line. Since you have only one line you had only one capture. And `$` indicates end of line. Maybe in your regex-engine it matches space as well. – Aedvald Tseh Jan 06 '18 at 15:51
  • All your data is one line and therefore there is no need to use `$`or `^`. Just remove it and it should work. – Aedvald Tseh Jan 06 '18 at 15:52
  • I'm aware of that and that's why I removed it in later attempts, as stated in the third paragraph. The regex engine is PHP7.1, if that makes a difference. However, the issue here is, that the capuring group gets overwritten instead of added. – wawa Jan 06 '18 at 15:53
  • to try it out, use "(,[0-9]{7}[0-9A-F]{8})+" as regex on https://regex101.com/?flags=[g]&flavor=php with the given input. And then add a whitespace infront of any comma. – wawa Jan 06 '18 at 15:55
  • `/(?<=,)([0-9]{7}[0-9A-F]{8})+/` – splash58 Jan 06 '18 at 16:16
  • Include links to regex101 instead of Pictures. – SamWhan Jan 06 '18 at 16:24
  • @splash58 is it better to use a positive lookbehind, than to use a group and just exclude it from the result as I did in Edit 2 ("/(?:,)([0-9]{7}[0-9A-F]{8})/")? If so, what's the difference? Also is there a way to validate if the initial "FooBar" is there at the beginning of the string with your way? – wawa Jan 06 '18 at 16:24
  • https://regex101.com/r/ZvYlQ0/1 – splash58 Jan 06 '18 at 16:26
  • You can't test by the same expression the structure of the full string if you want to separate groups :( – splash58 Jan 06 '18 at 16:28
  • Maybe [`(?:^[^,]+|(?:[0-9]{7}[0-9A-F]{8})+)`](https://regex101.com/r/0kZlPo/1) – The fourth bird Jan 06 '18 at 16:34
  • @Thefourthbird almoast, the issue with this one is, that it will match "FooBar", but also "Baz" at the beginning. I need one that only matches "FooBar" – wawa Jan 06 '18 at 16:41
  • Like [`(?:FooBar|(?:[0-9]{7}[0-9A-F]{8})+)`](https://regex101.com/r/OgSjzP/1/)? – The fourth bird Jan 06 '18 at 16:48
  • even closer! It should however only match it if there's "FooBar" infront, not optionally have it (and match it if it's there) – wawa Jan 06 '18 at 16:53

4 Answers4

1

Is that what you want?

$in = 'C7,22801422CFE0F63 ,2280141C5EF0F63 ,22801402EFD0F63 ,2280138C5ED0F63 ,228024329897530 ,228023829877530';

preg_match_all('/(^\w+|\G)\h*(,[0-9]{7}[0-9A-F]{8})/', $in, $m);
print_r($m);

Output:

Array
(
    [0] => Array
        (
            [0] => C7,22801422CFE0F63
            [1] =>  ,2280141C5EF0F63
            [2] =>  ,22801402EFD0F63
            [3] =>  ,2280138C5ED0F63
            [4] =>  ,228024329897530
            [5] =>  ,228023829877530
        )

    [1] => Array
        (
            [0] => C7
            [1] => 
            [2] => 
            [3] => 
            [4] => 
            [5] => 
        )

    [2] => Array
        (
            [0] => ,22801422CFE0F63
            [1] => ,2280141C5EF0F63
            [2] => ,22801402EFD0F63
            [3] => ,2280138C5ED0F63
            [4] => ,228024329897530
            [5] => ,228023829877530
        )

)

Explanation:

(               : start group 1
  ^\w+          : beginning of line, 1 or more word characters
  |             : O
  \G            : match form this point
)               : end group 1
\h*             : 0 or more horizontal spaces
(               : start group 2
  ,             : a comma
  [0-9]{7}      : 7 digits
  [0-9A-F]{8}   : 8 hexa
)               : end group 2
Toto
  • 89,455
  • 62
  • 89
  • 125
  • Almoast perfect, I managed to achive exactly as I wan't with some little changes. "(?:^C7|\G)\h*(,[0-9]{7}[0-9A-F]{8})" https://regex101.com/r/WRlHrv/1 – wawa Jan 06 '18 at 16:56
1

To capture all chucks including the first part, you could try:

(?:FooBar|(?:[0-9]{7}[0-9A-F]{8})+)

Explanation

  • A non capturing group (?:
  • Match FooBar
  • Or |
  • You format in a on capturing group repeated one or more times (?:[0-9]{7}[0-9A-F]{8})+
  • Close non capturing group

    Output

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
1

ehmmm... maybe i can't understand the problem but your regex will work for the first scenario removing the trailing +

(,[0-9]{7}[0-9A-F]{8})
user6307854
  • 99
  • 1
  • 4
0

You can build a pattern to get contiguous matches using the A flag (that means Anchored). The main interest is that you can extract your values and check the format of the line at the same time using a lookahead:

$pattern = '~
    (?!^)  # fails at the start of the string
    ( \h*,\h* (?<value>[0-9]{7}[A-F0-9]{8}) )
    # the first capture group is useful to shorten the 
    # the lookahead in the second branch.
  |
    (?<first>[a-zA-Z0-9]+)(?=(?1)*$)
~xA';

if ( preg_match_all($pattern, $yourstring, $matches) ) {
    echo $matches['first'][0], PHP_EOL;
    print_r(array_values(array_filter($matches['value'])));
} 

demo

The A flag forces each match to start at the beginning of the string or at the end of the previous match.

The first branch describes a comma separated value and the second branch the start of the line.

The lookahead (?=(?1)*$) checks forward the structure of the line. If this one fails, no match is possible.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125