0

I have a large database (over 2 million rows) where I store statistics from users. Users have Unique IDs which contain parts (for example hashed MAC address). The example Unique ID string is:

"AAAAAA-BBBBBB-CCCCCC-DDDDDD-EEEEEE-FFFFFFF"

Sometimes a part of the Unique ID changes and then it's for example:

"ZZZZZZ-BBBBBB-CCCCCC-DDDDDD-EEEEEE-XXXXXXX"

I want to identify users and select all rows, where at least 3 (or different value) Unique ID parts match, using PHP. Is there any useful syntax which I could use to do that? I know that I can select all rows containing at least 1 Unique ID part, then explode to an array by the "-" character and check if contains at least 3 same words, but I believe that it isn't the fastest possible way.

Basically, I would like to use something like:

WHERE `UniqueID` REGEXP 'AAAAAA|BBBBBB|CCCCCC|DDDDDD|EEEEEE|FFFFFFF';

But selecting only if 3 of 6 words match.

Mona
  • 337
  • 3
  • 15
  • 1
    Your unique id should be split into multiple columns, then this would be much simpler. Also, do the positions of the matches have to match? – Gordon Linoff Apr 13 '17 at 12:30
  • if you have always the same type of data 'credit card style with letters look-alike' you might be interested in [THIS](http://stackoverflow.com/questions/17392197/selecting-part-of-a-field-with-a-regex) or [THIS](http://stackoverflow.com/questions/1326063/using-columns-in-a-regexp-in-mysql) and [THIS](http://stackoverflow.com/questions/9315647/regex-credit-card-number-tests) – OldPadawan Apr 13 '17 at 12:33
  • Thanks for answers! Unfortunately, the columns can contain "unique id parts" not in the same order. Sometimes the Unique ID can be shorter (for example 4 instead of 6 parts), etc. Those parts are usually hashed hardware information (numbers in random length, maximum 12 characters long). – Mona Apr 13 '17 at 12:39
  • 'AAAAAA|BBBBBB|CCCCCC|DDDDDD|EEEEEE|FFFFFFF' is equivalent to like a or like b or like c or like d or like e **or** like f. but you want at least three matches regardless of their position?? – Krish Apr 13 '17 at 12:58
  • Yes @krishKM, that's what I need. – Mona Apr 13 '17 at 13:25

2 Answers2

1

YOU MIGHT WANT TO CHECK PERFORMANCE

Consider this query which uses inStr to check if a string is available within your id string. i used substring_index to extract the parts. just in case if you are interested in splitting or extracting part of the id string to create a temporary table but for this answer you can ignore it.

Ideally create your own stored procedure with keyword1, keyword2 & keyword3 as in parameter and then you can perform the search and return results.

Where condition checks whether the keyword1, 2 & 3 are found in your id string.

select 
    'ZZZZZZ-BBBBBB-CCCCCC-DDDDDD-EEEEEE-XXXXXXX' as Id,
    substring_index('ZZZZZZ-BBBBBB-CCCCCC-DDDDDD-EEEEEE-XXXXXXX','-', 1) as part1,
    substring_index(substring_index('ZZZZZZ-BBBBBB-CCCCCC-DDDDDD-EEEEEE-XXXXXXX','-', 2),'-', -1) as part2,
    substring_index(substring_index('ZZZZZZ-BBBBBB-CCCCCC-DDDDDD-EEEEEE-XXXXXXX','-', 3),'-', -1) as part3,
    substring_index(substring_index('ZZZZZZ-BBBBBB-CCCCCC-DDDDDD-EEEEEE-XXXXXXX','-', 4),'-', -1) as part4,
    substring_index(substring_index('ZZZZZZ-BBBBBB-CCCCCC-DDDDDD-EEEEEE-XXXXXXX','-', 5),'-', -1) as part5,
    substring_index(substring_index('ZZZZZZ-BBBBBB-CCCCCC-DDDDDD-EEEEEE-XXXXXXX','-', 6),'-', -1) as part6
from dual
WHERE

    instr('ZZZZZZ-BBBBBB-CCCCCC-DDDDDD-EEEEEE-XXXXXXX','BBBBBB') >= 1 -- keyword1
    and instr('ZZZZZZ-BBBBBB-CCCCCC-DDDDDD-EEEEEE-XXXXXXX','CCCCCC') >= 1 -- keyword2
    and instr('ZZZZZZ-BBBBBB-CCCCCC-DDDDDD-EEEEEE-XXXXXXX','DDDDDD') >= 1 -- keyword3
;

EDIT if the above query is working you can add your logic.

select 
    'ZZZZZZ-BBBBBB-CCCCCC-DDDDDD-EEEEEE-XXXXXXX' as Id


from dual
WHERE
    (
    -- below logic gives true when 3 or more keywords are found. change this accordingly
        (instr('ZZZZZZ-BBBBBB-CCCCCC-DDDDDD-EEEEEE-XXXXXXX',ik1) >0)+
        (instr('ZZZZZZ-BBBBBB-CCCCCC-DDDDDD-EEEEEE-XXXXXXX',ik2) >0)+
        (instr('ZZZZZZ-BBBBBB-CCCCCC-DDDDDD-EEEEEE-XXXXXXX',ik3) >0)+
        (instr('ZZZZZZ-BBBBBB-CCCCCC-DDDDDD-EEEEEE-XXXXXXX',ik4) >0)+
        (instr('ZZZZZZ-BBBBBB-CCCCCC-DDDDDD-EEEEEE-XXXXXXX',ik5) >0)+
        (instr('ZZZZZZ-BBBBBB-CCCCCC-DDDDDD-EEEEEE-XXXXXXX',ik6) >0)
    ) >=3

;
Krish
  • 5,917
  • 2
  • 14
  • 35
  • Thanks for your time. I tested it and it takes 5 seconds to select from ~2 mln rows database. However, I don't understand one thing: the query should search for 6 keywords, and success only when 3 or more matches. When I add an additional "and instr(`UniqueID`,'xxxxxxxxx') >= 1 -- keyword4" the query doesn't work. So I guess I need to use "OR" but select only if there are 3 matches... – Mona Apr 13 '17 at 13:53
  • Awesome! Thank you for your help. I'll post how I did it using PHP in a separated "answer". – Mona Apr 13 '17 at 14:49
0

Thanks to @krishKM I found the best solution for my Unique IDs selection using PHP.

First I made a function which puts Unique ID parts to an array, but keeps "delimeters" on both sides. It helps to prevent incorrect selections, for example when the ID part is very short.

function SplitUniqueIDKeepDelimeters($UniqueID, $Delimeter = "-"){
    $UniqueIDParts = array();
    $Remaining = $UniqueID;
    $pos = strpos($Remaining, $Delimeter);
    $firstPart = true;
    while ($pos !== false) {
        $Part = substr($Remaining,0,$pos + 1);
        if(!$firstPart){
            $Part = $Delimeter.$Part;
        }
        $firstPart = false;
        $UniqueIDParts[] = $Part;
        $Remaining = substr($Remaining,$pos+1);
        $pos = strpos($Remaining, $Delimeter);
    }
    if(!$firstPart){
        $Remaining = $Delimeter.$Remaining;
    }
    $UniqueIDParts[] = $Remaining;
    return $UniqueIDParts;
}

Then based on @krishKM's answer I created the selection syntax:

function GenerateSelectingString($UniqueIDParts, $RowName = "UniqueID", $HowManyMatches = 3){
    $First = true;
    $FinishedString = " (";
    foreach ($UniqueIDParts as $key => $value){
        if(!$First){
            $FinishedString .= "+";
        }
        $First = false;
        $FinishedString .= "(instr(`".$RowName."`,'".$value."') > 0)";

    }
    $FinishedString .= ") >=".$HowManyMatches;
    return $FinishedString;
}

Test:

$TestArray = SplitUniqueIDKeepDelimeters("21680-7886-2761736077-72316693-753974682-041548389245");
$TestSelectString = GenerateSelectingString($TestArray, "UniqueID");

echo "Unique ID parts array:<br>";
var_dump($TestArray);
echo "<br><br>Query String:<br>";
echo $TestSelectString;
echo "<br>";

$sql2 = "SELECT * FROM `Statistics`.`Statistics` WHERE ".$TestSelectString. " order by `lognumber` DESC;";
    $result2 = mysqli_query($conn, $sql2);
    if (mysqli_num_rows($result2) > 0) {
    while($row2 = mysqli_fetch_assoc($result2)) {
    echo "<br>";
    var_dump($row2);
    echo "<br>";
    }
}
else{
    echo "<br>".mysqli_errno($conn) . ": " . mysqli_error($conn) . "\n";
}

Output:

Unique ID parts array: array(6) { [0]=> string(6) "21680-" [1]=> string(6) "-7886-" [2]=> string(12) "-2761736077-" [3]=> string(10) "-72316693-" [4]=> string(11) "-753974682-" [5]=> string(13) "-041548389245" } 

Query String: ((instr(`UniqueID`,'21680-') > 0)+(instr(`UniqueID`,'-7886-') > 0)+(instr(`UniqueID`,'-2761736077-')
> 0)+(instr(`UniqueID`,'-72316693-') > 0)+(instr(`UniqueID`,'-753974682-') > 0)+(instr(`UniqueID`,'-041548389245') > 0)) >=3

array(60) {  ...
Mona
  • 337
  • 3
  • 15