1

I am adding a feature to an application that allows authorised oil rig personnel to submit weather reports (for use by our pilots when planning flights) to our system via email. The tricky part is that we want to match these reports to a particular oil platform, but the personnel (and their email accounts) can move between rigs.

We already have a list of waypoints that each have an "aliases" field. Basically if the email subject contains something in the aliases field, we should match the email to that waypoint.

The subject could be "Weather report 10 April @ 1100 Rig A for you as requested"

The aliases for that waypoint would be something like "RRA RPA Rig A RigA"

Keep in mind there is a similar list of aliases for all the other waypoints we have.

Is there a better way of matching than iterating through each word of each alias and checking if it's a substring of the email subject? Because that sounds like a n^2 sort of problem.

The alternative is for us to put a restriction and tell the operators they have to put the rig name at the start or end of the subject.

  • Convert the list to a regular expression `/RRA|RPA|Rig A|RigA/` then do a regular expression match. – Barmar Apr 17 '20 at 23:03
  • The PCRE engine will do a better job of optimizing the work than anyone with less than a masters in CompSci. Use [`preg_quote()`](https://www.php.net/manual/en/function.preg-quote.php) on each item in the list, and then `implode()` it to create the expression. – Sammitch Apr 17 '20 at 23:21
  • Although enforcing a standard format for requests is probably going to serve you better in the long run. Trying to play whack-a-mole with every bizarre thing a user decides to type is never a productive use of anyone's time. – Sammitch Apr 17 '20 at 23:22
  • By enforcing a standard format we could always fallback to not matching to a waypoint but letting it through regardless, which isn’t quite as nice but I could have convinced the bosses that was ok. I like the regex trick, I was hoping there was something like that I didn’t know about! – Josh Pirihi Apr 18 '20 at 00:31

1 Answers1

0

This sounds more like an algorithms question than a PHP question specifically. Take a look at What is the fastest substring search algorithm?

Well you can transform this into something like an O(n log n) algorithm, but it depends on the implementation specifics of stripos():

define('RIG_ID_1', 123);
define('RIG_ID_2', 456);

function get_rig_id($email_subject) {
    $alias_map = [
        'RRA' => RIG_ID_1,
        'RPA' => RIG_ID_1,
        'Rig A' => RIG_ID_1,
        'RigA' => RIG_ID_1,
        // ...
    ];
    foreach(array_keys($alias_map) as $rig_substr) {
        if(stripos($email_subject, $rig_substr) !== false) {
            return $alias_map[$rig_substr];
        }
    }
    return null;
}

Here each substring is examined by stripos() exactly once. Probably a better solution is to compose these strings into a series of regexes. Internally, the regex engine is able to scan text very efficiently, typically scanning each character only one time:

ex.:

<?php

define('RIG_ID_1', 123);
define('RIG_ID_2', 456);

function get_rig_id($email_subject) {
    $alias_map = [
        '/RRA|RPA|Rig\\sA|RigA/i' => RIG_ID_1,
        '/RRB|RPB|Rig\\sB|RigB/i' => RIG_ID_2,
        // ...
    ];
    foreach(array_keys($alias_map) as $rig_regex) {
        if(preg_match($rig_regex, $email_subject)) {
            return $alias_map[$rig_regex];
        }
    }
    return null;
}

For your purposes the practical solution is very much dependent upon how many rigs you've got an how many substrings per rig. I suspect that unless you're dealing with tens of thousands of rigs or unless performance is a critical aspect of this application, a naive O(n^2) solution would probably suffice. (Remember that premature optimization is the root of all evil!) A simple benchmark would bear this out.

An even-better solution -- and potentially faster -- would be to set up an elasticsearch instance, but once again that may be too much effort to go to when a naive approach would suffice in a fraction of the implementation time.

Max
  • 913
  • 1
  • 7
  • 18