Parse formatted string with labels in ALL-CAPS followed by their value to generate an associative array

Question

$string = 'Audi MODEL 80 ENGINE 1.9 TDi';
list($make,$model,$engine) = preg_split('/( MODEL | ENGINE )/',$string);

Anything before "MODEL" would be considered "MAKE string".
Anything before "ENGINE" will be considered "MODEL string".
Anything after "ENGINE" is the "ENGINE string".

But we usually have more information in this string.

//  possible variations:
$string = 'Audi MODEL 80 ENGINE 1.9 TDi POWER 90Hk TORQUE 202Nm GEAR man DRIVE 2wd YEAR 1996';

$string = 'Audi MODEL 80 ENGINE 1.9 TDi YEAR 1996 NOTE this engine needs custom stage GEAR auto';    

$string = 'Audi MODEL 80 ENGINE 1.9 TDi GEAR man YEAR 1996';

$string = 'Audi MODEL 80 ENGINE 1.9 TDi YEAR 1996 DRIVE 2wd';

MODEL and ENGINE is always present, and is always the start of the string.

The rest (POWER,TORQUE,GEAR,DRIVE,YEAR,NOTE) may vary, both in sorting order, and if they're even there or not.

Since we can't know for sure how the ENGINE string ends, or which of the other keywords will be the first to come right after, I thought it would be possible to create an array with the keywords.
Then do some sort of a string search for first occurrence of a word that matches one of the keyword in the array.

I do need to keep the matched word.

Another way of putting this might be: "How to split the string on/before each occurrence of words in array"

You already have a solution, do you have any problem with it? — shingo, Feb 14 '23 at 15:30

Markus AO · Answer 1 · 2023-02-15T15:34:57.793

To keep the "bits" intact with the keyword included, you can use preg_split with a lookahead that will split on a space followed by any one of your keywords. For example:

$string = 'Audi MODEL 80 ENGINE 1.9 TDi POWER 90Hk TORQUE 202Nm GEAR man DRIVE 2wd YEAR 1996';

$bits = preg_split('~\s+(?=(MODEL|ENGINE|POWER|TORQUE|GEAR|DRIVE|YEAR|NOTE)\b)~', $string);

Results in:

array(8) {
    [0] · string(4) "Audi"
    [1] · string(8) "MODEL 80"
    [2] · string(14) "ENGINE 1.9 TDi"
    [3] · string(10) "POWER 90Hk"
    [4] · string(12) "TORQUE 202Nm"
    [5] · string(8) "GEAR man"
    [6] · string(9) "DRIVE 2wd"
    [7] · string(9) "YEAR 1996"
}

If you want to parse these into key/value pairs, it's simple:

// Initialize array; get the "unnamed" make:
$data = [
    'MAKE' => array_shift($bits),
];

// Iterate any other known keys found:
foreach($bits as $bit) {
    $pair = explode(' ', $bit, 2);
    $data[$pair[0]] = $pair[1];
}

Results in:

array(8) {
    ["MAKE"] · string(4) "Audi"
    ["MODEL"] · string(2) "80"
    ["ENGINE"] · string(7) "1.9 TDi"
    ["POWER"] · string(4) "90Hk"
    ["TORQUE"] · string(5) "202Nm"
    ["GEAR"] · string(3) "man"
    ["DRIVE"] · string(3) "2wd"
    ["YEAR"] · string(4) "1996"
}

mickmackusa · Accepted Answer · 2023-02-15T19:59:37.483

If you'd like to have a dynamic associative array:

Prepend MAKE to the string
Use preg_match_all() to capture pairs of labels and values in the formatted string
Use array_column() to restructure the columns of matches into an associative array.

Code: (Demo)

$strings = [
    'Audi MODEL 80 ENGINE 1.9 TDi POWER 90Hk TORQUE 202Nm GEAR man DRIVE 2wd YEAR 1996',
    'Audi MODEL 80 ENGINE 1.9 TDi YEAR 1996 NOTE this engine needs custom stage GEAR auto',
    'Audi MODEL 80 ENGINE 1.9 TDi GEAR man YEAR 1996',
    'Audi MODEL 80 ENGINE 1.9 TDi YEAR 1996 DRIVE 2wd'
];

foreach ($strings as $string) {
    preg_match_all('/\b([A-Z]+)\s+(\S+(?:\s+\S+)*?)(?=$|\s+[A-Z]+\b)/', 'MAKE ' . $string, $m, PREG_SET_ORDER);
    var_export(array_column($m, 2, 1));
    echo "\n---\n";
}

Output:

array (
  'MAKE' => 'Audi',
  'MODEL' => '80',
  'ENGINE' => '1.9 TDi',
  'POWER' => '90Hk',
  'TORQUE' => '202Nm',
  'GEAR' => 'man',
  'DRIVE' => '2wd',
  'YEAR' => '1996',
)
---
array (
  'MAKE' => 'Audi',
  'MODEL' => '80',
  'ENGINE' => '1.9 TDi',
  'YEAR' => '1996',
  'NOTE' => 'this engine needs custom stage',
  'GEAR' => 'auto',
)
---
array (
  'MAKE' => 'Audi',
  'MODEL' => '80',
  'ENGINE' => '1.9 TDi',
  'GEAR' => 'man',
  'YEAR' => '1996',
)
---
array (
  'MAKE' => 'Audi',
  'MODEL' => '80',
  'ENGINE' => '1.9 TDi',
  'YEAR' => '1996',
  'DRIVE' => '2wd',
)
---

This is not a new concept/technique. The only adjustment to make is how to identify the keys/labels in the original string. Instead of [A-Z]+ you may wish to explicitly name each label and separate them in the pattern with pipes. See these other demonstrations:

Alternatively, instead of using a regex to parse the string, you could manipulate the string into a standardized format that a native PHP function can parse. (Demo)

foreach ($strings as $string) {
    var_export(
        parse_ini_string(
            preg_replace(
                '~\s*\b(MAKE|MODEL|ENGINE|POWER|TORQUE|GEAR|DRIVE|YEAR|NOTE)\s+~',
                "\n$1=",
                'MAKE ' . $string
            )
        )
    );
    echo "\n---\n";
}

Looks like this got me exactly what I needed. Put the keywords in its own array, and replaced `[A-Z]+` with `['.implode('|',$keywords).']+`. This returned an array with the keywords as the key, and the following string as the value. — ThomasK, Feb 15 '23 at 08:02
No. Be careful. In regex, square braced subpatterns denote a "character class" or a list of whitelisted characters. You want a parenthetical subpattern to express a "capture group". See this amended demo: https://3v4l.org/eWqdq — mickmackusa, Feb 15 '23 at 09:12
Yep. I wasn't getting the expected result when other capital letters matched. So I was about writing a follow up question here, but you all ready give me the answer.. So thanks again — ThomasK, Feb 15 '23 at 12:39
Explicit keys for the capture will be "safer" than `([A-Z]+)\s+` for the "any key" match, where values may have all-caps words, as often is the case with car models (`GT~, `GR`, etc. incl. many single capital letters). — Markus AO, Feb 15 '23 at 15:42
@Thomas we always prefer diverse/challenging sample data in regex-related questions because they help to identify the variability of your actual project data. The `[A-Z]` regex works perfectly on the supplied data, but is evidently not robust enough for real data. Researchers may have the slightly different challenge of not knowing all possible labels in advance. I tried to make my answer cover both scenarios. Thanks for the feedback. — mickmackusa, Feb 15 '23 at 20:07

score 0 · Answer 3 · answered Feb 14 '23 at 18:35

If you'd prefer a non-RegEx method, you could also just break into individual tokens (words) and build an array. The code below makes some presumptions about whitespace which, if it is a problem, could be addressed with a replace possibly.

// The first group to assign un-prefixed items to
$firstGroup = 'MAKE';

// Every possible word grouping
$wordList = ['ENGINE', 'MODEL', 'POWER', 'TORQUE', 'GEAR', 'DRIVE', 'YEAR'];

// Test string
$string = 'Audi MODEL 80 ENGINE 1.9 TDi POWER 90Hk TORQUE 202Nm GEAR man DRIVE 2wd YEAR 1996';

// Key/value of group name and values
$groups = [];

// Default to the first group
$currentWord = $firstGroup;
foreach (explode(' ', $string) as $word) {

    // Found a special word, reset and continue the hunt
    if (in_array($word, $wordList)) {
        $currentWord = $word;
        continue;
    }

    // Assign. The subsequent for loop could be removed by just doing string concatenation here instead
    $groups[$currentWord][] = $word;
}

// Optional, join each back into a string
foreach ($groups as $key => $values) {
    $groups[$key] = implode(' ', $values);
}

var_dump($groups);

Outputs:

array(8) {
  ["MAKE"]=>
  string(4) "Audi"
  ["MODEL"]=>
  string(2) "80"
  ["ENGINE"]=>
  string(7) "1.9 TDi"
  ["POWER"]=>
  string(4) "90Hk"
  ["TORQUE"]=>
  string(5) "202Nm"
  ["GEAR"]=>
  string(3) "man"
  ["DRIVE"]=>
  string(3) "2wd"
  ["YEAR"]=>
  string(4) "1996"
}

Demo: https://3v4l.org/D4pvl

Parse formatted string with labels in ALL-CAPS followed by their value to generate an associative array

3 Answers3