0

I have a Web App where I allow users to upload an Excel File with a list of people in it. I have a PHP script that checks the file if the headers are correct, and if so, I do some error checking. The data in the excel file essentially becomes a 2D array. I check for duplicated entries where a duplicate is considered if one row's given name and birth date are the same with another row. I also check if the values at certain columns/array indexes follow certain formats and values. If the values are correct, then the data are saved in the database.

If there are duplicates, I simply show the user that a certain John Doe or Jane Doe has duplicates in the row number/s where the script found the duplicate/s. Similarly, if there are misentries for the columns, I show the user that John Doe's data is incorrect.

This already works, however, when I try to upload larger excel files, the script halts in the part where I check for duplicate rows. Admittedly, the way I do it is not the best way since it's an O(n^2) algorithm. I have a nested for loop where the outer one goes through each item, and the inner one goes through the next item and compares it to the value from the outer loop to check if it's a duplicate, and it does this until the end. So of course given a larger file the script will just eat up the memory allocated for it.

I'm pretty sure I can just increase the memory size allocated for PHP to make the algo work, but that's not really solving the problem and the script will still take longer.

I've already searched how to remove duplicates in a 2D array and I have this function:

function check_dupe_values_new($array, $single_indices, $combined_indices){

    $sizeOfArray = sizeof($array);
    echo "array size = $sizeOfArray<br>";

    $start_1 = microtime(true);
    $unique = array_map('unserialize', array_unique(array_map('serialize', $array)));
    $total_time_seconds_1 = microtime(true) - $start_1;
    
    $sizeOfUnique = sizeof($unique);

    echo "size of unique = $sizeOfUnique<br>";
    echo "time elapsed = $total_time_seconds_1<br><br>";


    $start_2 = microtime(true);
    $serialized = array_map('serialize', $array);
    $unique = array_unique($serialized);
    $resulting_array = array_intersect_key($array, $unique);
    $total_time_seconds_2 = microtime(true) - $start_2;

    $size_of_2nd_algo = sizeof($resulting_array);

    echo "size of new second unique array = $size_of_2nd_algo<br>";
    echo "time elapsed = $total_time_seconds_2<br><br>";

}

They basically do the same but the 2nd one is faster.

These two function well, however I think they compare the entire data row. I've also seen this one. However, I think it only checks for uniqeness of a single index, and not multiple indices like what I have in my situation.

So, my question now are:

  1. Is there a way to check for unique data rows in a 2D array given a certain set of column indices to check? (without having to resort to a nested for loop)
  2. Is there a way to identify where the said duplicates occur (if any)? (without having to resort to a nested for loop)
Razgriz
  • 7,179
  • 17
  • 78
  • 150
  • 1
    I'd start by sorting both arrays into Date | Name order. Then it should be possible to match duplicates using a single pass through each array. – Tangentially Perpendicular Jan 30 '21 at 11:04
  • @TangentiallyPerpendicular Sorry but there's only one 2D array, but I think that's a good start. I'll have to fix the dates though, because they're not all in the same format, then after that I can do your suggested sort method and then just check which ones are the same. – Razgriz Jan 30 '21 at 14:08
  • Related: [Filter/Remove rows where column value is found more than once in a multidimensional array](https://stackoverflow.com/q/45603614/2943403) – mickmackusa Dec 23 '22 at 21:07

1 Answers1

1

The approach of the answer php multi-dimensional array remove duplicate you have found is correct. You have to adjust they lookup key for the duplicate array check. Instead of using only one column you build a key which represents all the values you want to use to identify a unique row. Check the following example:

<?php
$data = array();
$data[] = array('John', 'Doe', '1997-04-23', 'some');
$data[] = array('Jane', 'Smith', '1994-01-15', 'other');
$data[] = array('John', 'Doe', '1991-10-05', 'data');
$data[] = array('Jane', 'Smith', '1994-01-15', 'to');
$data[] = array('John', 'Oliver', '1997-04-23', 'store');

function buildKeyForDuplicateIdentification($row) {
    return $row[0].'|'.$row[1].'|'.$row[2];
}

$duplicateInfo = array();
for ($i=0; $i<count($data); $i++) {
    $currentRow = $data[$i];
    $key = buildKeyForDuplicateIdentification($currentRow);
    if (isset($duplicateInfo[$key])) {
        echo 'The row at index '.$i.' uses the same identifier '.
             'columns as the row at index '.$duplicateInfo[$key]."\n";
        die();
    }
    $duplicateInfo[$key] = $i;
}
echo "No duplicates found\n";

This will generate the following output:

The row at index 3 uses the same identifier columns as the row at index 1

You can adjust the buildKeyForDuplicateIdentification() function for the lookup key you want to use. You also can adjust the $duplicateInfo array to save an array of indices for the row index instead of saving only one row index to find all duplicates (of the same key). And you can remove the die() part and collect all the duplicate infos first, instead of stopping at the first duplicate.

Progman
  • 16,827
  • 6
  • 33
  • 48