0

We use a PHP script to read data from an API. For each returned record, we compare a field from that record with data in a MSSQL database. Different logic is implemented depending on whether or not these two fields (strings) match.

Matches are case-insensitive and we also do not care about trailing/leading white space.

//value of $p['organizationName'] = 'Forté Initiative'
$groupNameForMatching = trim(strtolower($p['organizationName']));

//there is a group name in this table = 'Forté Initiative'
$matchCountPrep=$conn->prepare("SELECT COUNT(*) AS MatchCount FROM schema.table WHERE LOWER(LTRIM(RTRIM(GroupName))) = :OrganizationName);
$matchCountPrep->bindParam(":OrganizationName", $groupNameForMatching);
try {
    $matchCountPrep->execute();
}
catch (PDOException $matchCountError) {
    echo "ERROR: " . $matchCountError->getMessage();
    exit();
}
$matchCount=$matchCountPrep->fetchColumn();

This code works as expected for all groups which do not contain special characters. For example:

"Chess club " and "Chess Club" match and the value of $matchCount = 1 for this group.

However "Forté Initiative" and "Forté Initiative" somehow do not match, even though they are identical strings. I believe this is probably due to the character encoding used for the character "é".

I do not have administrative nor dba access to the database server or instance. I do have administrator access to the web server and PHP configuration executing the script. However, this script will be migrated to another web server which I do not have administrative access to in the near future.

For the above reasons, I am looking for an in-code solution. Is there a way to force like-encoding on the two strings while making the comparison in the filter clause of the query?

Thank you for any suggestions. This is my first posting on Stack Overflow so please let me know if my question is lacking in any way and I will update. I appreciate your help!

EDIT: I received guidance on another site which led me to the problem with my code and two solutions.

The problem is that strtolower is not a multi-byte safe function and will not preserve native encoding. The accented character's encoding is broken when I use it. Additionally I was operating under a false assumption that the string comparisons would be case sensitive. The entire need for using strtolower does not actually exist.

Solution one: use mb_strtolower instead of strtolower. mb_strtolower successfully preserves the encoding and the MSSQL comparison behaves as expected when this function is used.

Solution two: compare the strings after trimming the white space but without forcing either side of the comparison to lower case. The filter clause of the query will still evaluate the strings as matching even if they do not match strictly by case. NOTE: this is true by default, but may not be true in all environments depending on the collation of your database.

  • Are you sure you pass `Forté Initiative` instead of `Forte Initiative`? – Ilyes Oct 22 '18 at 20:08
  • Yes in the database and API return, the value is "Forté Initiative". The value of $p['organizationName'] at the time of the evaluation is known, because it is later written to the database. We don't want to write it if it already exsits... however that particular group is being written when it already exists. And then our database ends up with multiple copies of "Forté Initiative" because it keeps failing to detect that the names are identical. – Dan Quixote Oct 22 '18 at 21:15
  • What encoding do you use on the DB column ? – Dan Oct 23 '18 at 14:13
  • Hi @Dan, I am not sure how to determine that. The result of `SELECT SERVERPROPERTY('Collation');` is: SQL_Latin1_General_CP1_CI_AS I am not admin on that server but can access the relevant schema via SSMS. Is there a way for me to determine the character encoding? – Dan Quixote Oct 23 '18 at 17:19
  • https://stackoverflow.com/questions/7321159/determining-the-character-set-of-a-table-database if you read this you want the second part of the accepted answer and it should tell you how to get the column encoding, but reading about the DB wide setting you have (https://stackoverflow.com/questions/5039211/what-does-collate-sql-latin1-general-cp1-ci-as-do) its all case insensitive so you shouldn't need to do the LOWER() function a tol – Dan Oct 24 '18 at 06:55

0 Answers0