How to parse a mostly consistent filename into meaningful parts?

Question

I have filenames like:

1234_56_78 A_FAIRLY_SHORT_TITLE_D.pdf

Luckily, the file naming is pretty consistent, but I can't absolutely guarantee that someone didn't use a space where they should have used an underscore.

With this in mind, I want to parse the string and extract the following details:

$project_no = '1234
$series_no = '56
$sheet_no = '78'
$revision = 'D'
$title = 'A Fairly Short Title'

Presently, I use the following to grab this info:

$filename = $_FILES['file']['name'][$i];
$filename = preg_replace('/\\.[^.\\s]{3,4}$/', '', $filename);
$parts = preg_split( "(_| )", $filename );
$project_no = $parts[0];
$series_no = $parts[1];
$sheet_no = $parts[2];
$revision = end($parts);

$title is simply everything that's left over after removing $parts[0] $parts[1], $parts[2], and end($parts), but how should I express that?

I thought I might be able to use

$title = implode(' ',\array_diff_key($parts, [0,1,2,end($parts)]));

But this doesn't remove the $revision bit at the end...

$title = FLOOR AS PROPOSED D

What am I missing, and am I unnecessarily over-complicating this?

[array_pop()](https://www.php.net/array_pop) and [array_shift()](https://www.php.net/array_shift) are your friends. — Matt Raines, Oct 31 '20 at 20:44

nice_dev · Accepted Answer · 2020-11-01T05:15:12.113

The array_diff_key looks at key comparison of both arrays. end() just moves the internal pointer of the array and is actually useless since the value returned from it can't be used in computing difference between 2 arrays' keys.

Current comparison behaves as

array_diff_key([0,1,2,3,4,5,6,7], [0,1,2,'D'])

which looks key wise as:

   array_diff_key([0,1,2,3,4,5,6,7], [0,1,2,3])

Hence, the end result of implode is concatenation of 4,5,6,7 keys' values.

To make the second parameter array values as keys, you can use array_flip to make keys as values and values as keys with the below expression:

$title = implode(' ',\array_diff_key($parts, array_flip([0,1,2,count($parts)-1])));

Demo: https://3v4l.org/J6b5r

score 1 · Answer 2 · answered Dec 05 '20 at 12:36

I fear you are over-complicating this. I think a single preg_match() call is the most direct way to parse your string.

It looks like you grabbed the regex pattern from https://stackoverflow.com/a/2395905/2943403 to trim the extension from your filename; however, I recommend using a regex function when a single non-regex function serves the same purpose.

pathinfo($filename', PATHINFO_FILENAME)

Now that the extension has been removed, let's move on to the parsing.

Code: (Demo)

$filename = '1234_56_78 A_FAIRLY_SHORT_TITLE_D.pdf';
preg_match('~([^ _]+)[ _]([^ _]+)[ _]([^ _]+)[ _](.+)[ _](\S)~', pathinfo($filename, PATHINFO_FILENAME), $m);

var_export([
    'project_no' => $m[1],
    'series_no' => $m[2],
    'sheet_no' => $m[3],
    'title' => str_replace('_', ' ', $m[4]),
    'revision' => $m[5],
]);

Output:

array (
  'project_no' => '1234',
  'series_no' => '56',
  'sheet_no' => '78',
  'title' => 'A FAIRLY SHORT TITLE',
  'revision' => 'D',
)

If you are deadset on using preg_split(), then the pattern becomes super simple, but there is a little more mopping up to do.

Code: (Demo)

$filename = '1234_56_78 A_FAIRLY_SHORT_TITLE_D.pdf';
$m = preg_split('~ |_~', pathinfo($filename, PATHINFO_FILENAME));
$revision = array_pop($m);

var_export([
    'project_no' => $m[0],
    'series_no' => $m[1],
    'sheet_no' => $m[2],
    'title' => implode(' ', array_slice($m, 3)),
    'revision' => $revision,
]);
// same output as earlier snippet

How to parse a mostly consistent filename into meaningful parts?

2 Answers2