I have a Web App where I allow users to upload an Excel File with a list of people in it. I have a PHP script that checks the file if the headers are correct, and if so, I do some error checking. The data in the excel file essentially becomes a 2D array. I check for duplicated entries where a duplicate is considered if one row's given name and birth date are the same with another row. I also check if the values at certain columns/array indexes follow certain formats and values. If the values are correct, then the data are saved in the database.
If there are duplicates, I simply show the user that a certain John Doe or Jane Doe has duplicates in the row number/s where the script found the duplicate/s. Similarly, if there are misentries for the columns, I show the user that John Doe's data is incorrect.
This already works, however, when I try to upload larger excel files, the script halts in the part where I check for duplicate rows. Admittedly, the way I do it is not the best way since it's an O(n^2) algorithm. I have a nested for loop where the outer one goes through each item, and the inner one goes through the next item and compares it to the value from the outer loop to check if it's a duplicate, and it does this until the end. So of course given a larger file the script will just eat up the memory allocated for it.
I'm pretty sure I can just increase the memory size allocated for PHP to make the algo work, but that's not really solving the problem and the script will still take longer.
I've already searched how to remove duplicates in a 2D array and I have this function:
function check_dupe_values_new($array, $single_indices, $combined_indices){
$sizeOfArray = sizeof($array);
echo "array size = $sizeOfArray<br>";
$start_1 = microtime(true);
$unique = array_map('unserialize', array_unique(array_map('serialize', $array)));
$total_time_seconds_1 = microtime(true) - $start_1;
$sizeOfUnique = sizeof($unique);
echo "size of unique = $sizeOfUnique<br>";
echo "time elapsed = $total_time_seconds_1<br><br>";
$start_2 = microtime(true);
$serialized = array_map('serialize', $array);
$unique = array_unique($serialized);
$resulting_array = array_intersect_key($array, $unique);
$total_time_seconds_2 = microtime(true) - $start_2;
$size_of_2nd_algo = sizeof($resulting_array);
echo "size of new second unique array = $size_of_2nd_algo<br>";
echo "time elapsed = $total_time_seconds_2<br><br>";
}
They basically do the same but the 2nd one is faster.
These two function well, however I think they compare the entire data row. I've also seen this one. However, I think it only checks for uniqeness of a single index, and not multiple indices like what I have in my situation.
So, my question now are:
- Is there a way to check for unique data rows in a 2D array given a certain set of column indices to check? (without having to resort to a nested for loop)
- Is there a way to identify where the said duplicates occur (if any)? (without having to resort to a nested for loop)