php: better way to split string into associative array

Question

I have a string like this:

"ALARM_ID/I4=1010001 ALARM_STATE/U4=eventcode ALARM_TEXT/A=WMR_MAP_EXPORT LOTS/A[1]=[ STEFANO ] ALARM_STATE/U1=1 WAFER/U4=1 VI_KLARF_MAP/A=/test/klarf.map KLARF_STEPID/A=StepID KLARF_DEVICEID/A=DeviceID KLARF_EQUIPMENTID/A=EquipmentID KLARF_SETUP_ID/A=SetupID RULE_ID/U4=1234 RULE_FORMULA_EXPRESSION/A=a < b && c > d RULE_FORMULA_TEXT/A=1 < 0 && 2 > 3 RULE_FORMULA_RESULT/A=FAIL TIMESTAMP/A=10-Nov-2020 09:10:11 99999999"

and my goal is to split into an associative array:

Array
(
    [ALARM_ID/I4] => 1010001
    [ALARM_STATE/U4] => eventcode
    [ALARM_TEXT/A] => WMR_MAP_EXPORT
    [LOTS/A[1]] => [ STEFANO ]
    [ALARM_STATE/U1] => 1
    [WAFER/U4] => 1
    [VI_KLARF_MAP/A] => /test/klarf.map
    [KLARF_STEPID/A] => StepID
    [KLARF_DEVICEID/A] => DeviceID
    [KLARF_EQUIPMENTID/A] => EquipmentID
    [KLARF_SETUP_ID/A] => SetupID
    [RULE_ID/U4] => 1234
    [RULE_FORMULA_EXPRESSION/A] => a < b && c > d
    [RULE_FORMULA_TEXT/A] => 1 < 0 && 2 > 3
    [RULE_FORMULA_RESULT/A] => FAIL
    [TIMESTAMP/A] => 10-Nov-2020 09:10:11 99999999
)

The unique (but maybe dirties) way that I found is through this script:

<?php
$msg = "ALARM_ID/I4=1010001 ALARM_STATE/U4=eventcode ALARM_TEXT/A=WMR_MAP_EXPORT LOTS/A[1]=[ STEFANO ] ALARM_STATE/U1=1 WAFER/U4=1 VI_KLARF_MAP/A=/test/klarf.map KLARF_STEPID/A=StepID KLARF_DEVICEID/A=DeviceID KLARF_EQUIPMENTID/A=EquipmentID KLARF_SETUP_ID/A=SetupID RULE_ID/U4=1234 RULE_FORMULA_EXPRESSION/A=a < b && c > d RULE_FORMULA_TEXT/A=1 < 0 && 2 > 3 RULE_FORMULA_RESULT/A=FAIL TIMESTAMP/A=10-Nov-2020 09:10:11 99999999";
$split = explode("=", $msg);
foreach($split as $k => $s) {
    $s = explode(" ", $s);
    $keys[] = array_pop($s);
    if ($s) $values[] = implode(" ", $s);
}
/*
 * this is needed if last parameter TIMESTAMP does not have ' ' (spaces) into value
 */
if (count($values) + 2 == count($keys)) array_push($values, array_pop($keys));
else                                    $values[ count($values) - 1 ] .= " " . array_pop($keys);
$params = array_combine($keys, $values);
print_r($params);
?>

Do you see a better way to split it maybe using regular expression or a different (elegant?) approach?

Can you change the string youre getting? A better practice would be get the recieved string in some sort of format like JSON or XML which would make it way easier to not get accidental parsing mistakes. Or can you not influence how you recieve the string? — Definitely not Rafal, Nov 11 '20 at 08:23
@DefinitelynotRafal unfortunately I cannot. The string is received from an automation host in VFEI (Virtual Factory Equipment Interface) format (that's ad unchangeable standard). — Stefano Radaelli, Nov 11 '20 at 08:31

mickmackusa · Answer 1 · 2020-11-11T12:37:02.497

The important thing to do in maintaining accuracy is to ensure that "keys" are properly matched.

Key strings will never contain a space or an equals sign. Value strings may contain either. Value strings will run to the end of the string or be followed by a space then the next key (which may not have any spaces or equal signs).

The key string can be "greedily" matched before the occurrence of the first encountered =.

The value string must not be greedily matched. This ensures that the value is not over-extended into the next key-value pair.

The lookahead after the value string ensures that the potential following key is not damaged/consumed.

Pattern Breakdown:

([^=]+)      #capture one ore more non-equals sign (greedily) and store as capture group #1
=            #match but do not capture an equals sign
(.+?)        #capture one or more of any non-newline character (giving back when possible / non-greedy) and store as capture group #2
(?=          #start lookahead
  $          #match the end of the string
  |          #OR operator
   [^ =]+=   #match space, then one or more non-space and non-equals characters, then match equals sign
)            #end lookahead

Code: (Demo)

$msg = "ALARM_ID/I4=1010001 ALARM_STATE/U4=eventcode ALARM_TEXT/A=WMR_MAP_EXPORT LOTS/A[1]=[ STEFANO ] ALARM_STATE/U1=1 WAFER/U4=1 VI_KLARF_MAP/A=/test/klarf.map KLARF_STEPID/A=StepID KLARF_DEVICEID/A=DeviceID KLARF_EQUIPMENTID/A=EquipmentID KLARF_SETUP_ID/A=SetupID RULE_ID/U4=1234 RULE_FORMULA_EXPRESSION/A=a < b && c > d RULE_FORMULA_TEXT/A=1 < 0 && 2 > 3 RULE_FORMULA_RESULT/A=FAIL TIMESTAMP/A=10-Nov-2020 09:10:11 99999999";

preg_match_all('~([^=]+)=(.+?)(?=$| [^ =]+=)~', $msg, $out);
var_export(array_combine($out[1], $out[2]));

Output:

array (
  'ALARM_ID/I4' => '1010001',
  'ALARM_STATE/U4' => 'eventcode',
  'ALARM_TEXT/A' => 'WMR_MAP_EXPORT',
  'LOTS/A[1]' => '[ STEFANO ]',
  'ALARM_STATE/U1' => '1',
  'WAFER/U4' => '1',
  'VI_KLARF_MAP/A' => '/test/klarf.map',
  'KLARF_STEPID/A' => 'StepID',
  'KLARF_DEVICEID/A' => 'DeviceID',
  'KLARF_EQUIPMENTID/A' => 'EquipmentID',
  'KLARF_SETUP_ID/A' => 'SetupID',
  'RULE_ID/U4' => '1234',
  'RULE_FORMULA_EXPRESSION/A' => 'a < b && c > d',
  'RULE_FORMULA_TEXT/A' => '1 < 0 && 2 > 3',
  'RULE_FORMULA_RESULT/A' => 'FAIL',
  'TIMESTAMP/A' => '10-Nov-2020 09:10:11 99999999',
)

Can anyone explain why fourthbird's answer is gathering more votes than my correct, accurate, concise answer? At one point, they both had 1 UV, but for some unknown reason, his/her answer is pulling ahead and biasing researchers away from my answer. If there is something beyond a popularity contest, I'd like to know what is going on. — mickmackusa, Nov 11 '20 at 11:52
I know I am not everyone's cup of tea, but the voting should be on the answer, not the answerer. If the UVs are from writing out the regex breakdown, then I am happy to edit my answer. — mickmackusa, Nov 11 '20 at 11:59
@Stefano I would like to understand the metric by which you found TheFourthBird's answer to be superior. I used regex101 to compare the patterns and these are the results: His first pattern: `([^\s=/]+/[^\s=]+)=(.*?)(?=\h+[^\s=/]+/|$)`, **42-character pattern**, *16 matches*, **921 steps** ; His second pattern: `([^\W_]+(?:_[^\W_]+)*/[^\s=]*)=(.*?)(?=\h+[^\s=/]+/|$)`, **54-character pattern**, *16 matches*, **1007 steps** ; My pattern: `([^=]+)=(.+?)(?=$| [^ =]+=)` , **27-character pattern** , *16 matches*, **809 steps** So, mine is provably more efficient and more concise. — mickmackusa, Nov 20 '20 at 10:11

The fourth bird · Accepted Answer · 2020-11-11T09:22:18.987

You could leverage the the presence of a / in all the keys

([^\s=/]+/[^\s=]+)=(.*?)(?=\h+[^\s=/]+/|$)

Explanation

( Capture group 1
- [^\s=/]+ Match 0+ times any char except a whitespace = or /
- /[^\s=]+ Then match / followed by the rest of the key
) Close group 1
= Match literally
(.*?) Capture group 2, match any char except a newline as least as possible
(?=\h+[^\s=/]+/|$) Assert a key like format containing a / (as used in group 1)

See a Regex demo and a Php demo.

Example code

$re = '`([^\s=/]+/[^\s=]+)=(.*?)(?=\h+[^\s=/]+/|$)`';
$str = 'ALARM_ID/I4=1010001 ALARM_STATE/U4=eventcode ALARM_TEXT/A=WMR_MAP_EXPORT LOTS/A[1]=[ STEFANO ] ALARM_STATE/U1=1 WAFER/U4=1 VI_KLARF_MAP/A=/test/klarf.map KLARF_STEPID/A=StepID KLARF_DEVICEID/A=DeviceID KLARF_EQUIPMENTID/A=EquipmentID KLARF_SETUP_ID/A=SetupID RULE_ID/U4=1234 RULE_FORMULA_EXPRESSION/A=a < b && c > d RULE_FORMULA_TEXT/A=1 < 0 && 2 > 3 RULE_FORMULA_RESULT/A=FAIL TIMESTAMP/A=10-Nov-2020 09:10:11 99999999
';

preg_match_all($re, $str, $matches);
$result = array_combine($matches[1], $matches[2]);

print_r($result);

Output

Array
(
    [ALARM_ID/I4] => 1010001
    [ALARM_STATE/U4] => eventcode
    [ALARM_TEXT/A] => WMR_MAP_EXPORT
    [LOTS/A[1]] => [ STEFANO ]
    [ALARM_STATE/U1] => 1
    [WAFER/U4] => 1
    [VI_KLARF_MAP/A] => /test/klarf.map
    [KLARF_STEPID/A] => StepID
    [KLARF_DEVICEID/A] => DeviceID
    [KLARF_EQUIPMENTID/A] => EquipmentID
    [KLARF_SETUP_ID/A] => SetupID
    [RULE_ID/U4] => 1234
    [RULE_FORMULA_EXPRESSION/A] => a < b && c > d
    [RULE_FORMULA_TEXT/A] => 1 < 0 && 2 > 3
    [RULE_FORMULA_RESULT/A] => FAIL
    [TIMESTAMP/A] => 10-Nov-2020 09:10:11 99999999
)

If the keys should all start with word characters separated by an underscore, you can start the pattern using a repeating part [^\W_]+(?:_[^\W_]+)*

It will match word chars except an _, and then repeat matching _ followed by word chars except _ until it reaches a /

([^\W_]+(?:_[^\W_]+)*/[^\s=]*)=(.*?)(?=\h+[^\s=/]+/|$)

Regex demo

Just out of curiosity: why use `\s` sometimes and `\h` some other time? I understand `\s` includes carriage returns as well (and maybe some vertical whitespaces), but since the original string doesn't appear to contain any, I'm wondering. — Jeto, Nov 11 '20 at 09:08
There is no mention of tabs or newlines in the sample input. The `\s`, the `\h`, and the `m` all seem needless to me. — mickmackusa, Nov 11 '20 at 09:11
@Jeto Fair question, I have used `\s` in the negated character class `[^\s=]+` to match any character except a whitespace char for the key as `\s` can also match a newline which I assume is not desired in the key. I use `\h` in the assertion to match horizontal whitespace chars to make sure the value is on the same line. I think for this example data you could use both `\s` or `\h` either way. — The fourth bird, Nov 11 '20 at 09:17
@mickmackusa the `m` should not be there, I was from copy pasting from the regex101 generated code. If you only want to match a space instead of `\s` or `\h` that is fine. I use it to match a broader range of whitspace chars. — The fourth bird, Nov 11 '20 at 09:20

KIKO Software · Answer 3 · 2020-11-11T11:25:35.643

I managed this code, using basic PHP functions. I think that a regular expression makes the code more difficult to read. Most of the time, even at the expense of having more verbose code, you are better off not using regular expressions. There might also be a performance impact.

$message = "ALARM_ID/I4=1010001 ALARM_STATE/U4=eventcode ALARM_TEXT/A=WMR_MAP_EXPORT LOTS/A[1]=[ STEFANO ] ALARM_STATE/U1=1 WAFER/U4=1 VI_KLARF_MAP/A=/test/klarf.map KLARF_STEPID/A=StepID KLARF_DEVICEID/A=DeviceID KLARF_EQUIPMENTID/A=EquipmentID KLARF_SETUP_ID/A=SetupID RULE_ID/U4=1234 RULE_FORMULA_EXPRESSION/A=a < b && c > d RULE_FORMULA_TEXT/A=1 < 0 && 2 > 3 RULE_FORMULA_RESULT/A=FAIL TIMESTAMP/A=10-Nov-2020 09:10:11 99999999";

foreach (explode(' ', $message) as $word) {
    if (strpos($word, '=')) {
        if (isset($key)) $parameters[$key] = $value; 
        list($key, $value) = explode('=', $word);
    }
    else $value .= " $word";
}    
$parameters[$key] = $value;     

echo '<pre>';
print_r($parameters);
echo '</pre>';

I chose to split on the spaces, then I look for the = characters to find the words with the keys in them.

There are, of course, other ways of doing the same, but all will involve a bit of extra work because of the strange format of the message.

This routine currently does not tolerate errors in the message string, but it can easily be expanded to tolerate various types of input errors.

php: better way to split string into associative array

3 Answers3