Comparison and replacing of Arabic Letters is Incorrect for Specific Letter

Question

I have a database build using SQL Server 2019, inside there is a function called handle_letters performs some processing using T-SQL REPLACE function as follows:

create or alter function handle_letters(@txt as nvarchar(50))
    returns nvarchar(50)
begin
    set @txt = replace(@txt, N'أ', N'ا');
    set @txt = replace(@txt, N'إ', N'ا');
    set @txt = replace(@txt, N'ى', N'ا');
    set @txt = replace(@txt, N'آ', N'ا');
    set @txt = replace(@txt, N'ؤ', N'ء');
    set @txt = replace(@txt, N'ـ', N'');
    set @txt = replace(@txt, N'ة', N'ه')
    set @txt = replace(@txt, N'ُ', N'');
    set @txt = replace(@txt, N'َ', N'');
    set @txt = replace(@txt, N'ً', N'');
    set @txt = replace(@txt, N'ٌ', N'');
    set @txt = replace(@txt, N'ِ', N'');
    set @txt = replace(@txt, N'ٍ', N'');
    set @txt = replace(@txt, N'ّ', N'');
    set @txt = replace(@txt, N'ئ', N'ء');
    return replace(@txt, N'عبدا', N'عبد ا');
end

The problem appears when working the letter Taa Marbouta "ة" as I need to return it to its linguistic origin as Haa "ه" (similar pronunciation to the English letter H) however, SQL server deals with the Taa Marbouta as same as Taa Mabsouta "ت" (similar pronunciation to the English letter T) even though they are 2 different letters in the Arabic language.

The following is a correct execution of the function:

select dbo.handle_letters(N'كرة') = "كره"

While the following is incorrect result:

select dbo.handle_letters(N'كرات')="كراه" which is incorrect as it should remain as it is "كرات"

The database collation is SQL_Latin1_General_CP1_CI_AS which I believe is the problem but I could be mistaken

Also the SQL Server runs on a Linux machine, meaning I cannot use DotNet to create a custom function in C# to handle this process. as indicated by this Github page

Did you search for anything? The duplicate was the first result when I searched. — Dale K, Jul 05 '23 at 06:50
The question doesn't seem to have anything to do with SQL Server's handling of Arabic but performing a custom conversion. `I cannot use DotNet` ??????? .NET Core runs on Linux since v1.0. That's the whole point it was created. All *Microsoft* tutorials cover Linux and some features, like dockerization, only really target Linux. You can publish a .NET application [straight into a Linux docker image](https://devblogs.microsoft.com/dotnet/announcing-builtin-container-support-for-the-dotnet-sdk/) for example — Panagiotis Kanavos, Jul 05 '23 at 06:56
The collation is definitely an issue. Unicode ensures the letters aren't converted but *collation* specifies the ordering and comparison rules. You can specify the collation of columns and comparisons but not variables. If you stored the characters in a lookup table you'd be able to specify the collation you want — Panagiotis Kanavos, Jul 05 '23 at 07:16
If you *must* run the script inside SQL Server (why??), choose Python. The C# extension was never a serious effort, more of a proof-of-concept. It's barely documented, hasn't been updated in 2 years except to bump .NET to 6 (not even 7) and depends on an ancient version of `Microsoft.Data.Analysis` (0.4.0 instead of 0.20.1) — Panagiotis Kanavos, Jul 05 '23 at 07:26
@DaleK, This is not about storing it is about manipulating that is why I didn't check those answers — Ossama Nasser, Jul 05 '23 at 07:37
@PanagiotisKanavos I have .Net core application running on Live Linux server, however the SQL Server C# extension is not supported on Linux, I even included a link to the GitHub page of the extension to confirm. As for Why I need to run the script inside SQL Server, I can run it from an external application, however from within SQL Server it is faster in committing changes also, it is a one time use as a preprocessing step is included in the software which uses the database preventing the need for such function. — Ossama Nasser, Jul 05 '23 at 07:41
@OssamaNasser that extension isn't supported, period. It's a proof-of-concept with a *single* committer. I suspect it was an intern project, never adopted by the SQL team itself. Otherwise, why didn't they target *F#*, the .NET language actually used for data science? — Panagiotis Kanavos, Jul 05 '23 at 07:47
As for `it is faster in committing changes also` not really. *Committing* is the same. The operation itself is slower and has to be repeated for every data row though. You'd need to use [NATIVE_COMPILATION](https://learn.microsoft.com/en-us/sql/relational-databases/in-memory-oltp/native-compilation-of-tables-and-stored-procedures?view=sql-server-ver16#native-compilation-of-stored-procedures) to get it near what C# would do. `a one time use as a preprocessing step` so an ETL job. That's best done in an external application. SSIS, Pandas, your own code. — Panagiotis Kanavos, Jul 05 '23 at 07:49
@OssamaNasser collation is not about storing data either, its about ordering and comparing data, which as best I can tell is what your question is about. — Dale K, Jul 05 '23 at 07:50
@OssamaNasser you should probably check Python scripts run through `sp_execute_external_script`. The input of a query will appear in the script as a Pandas Dataframe. Pandas supports Unicode normalization (along with most string, regex functions and more) through [pandas.Series.str.normalize](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.normalize.html). If you do need to map the values, [pandas.Series.map](https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html) will do it. You may be able to replace the function with 3-4 lines of Python — Panagiotis Kanavos, Jul 05 '23 at 07:55
It seems to be this line that is the problem. So are you sure the function is actually wrong? set @txt = replace(@txt collate Arabic_CI_AS, N'ة' collate Arabic_CI_AS, N'ه'collate Arabic_CI_AS ) -- i don't know arabic — siggemannen, Jul 05 '23 at 08:05

score 0 · Answer 1 · answered Jul 05 '23 at 14:20

Upon reviewing the function, I noticed that the line SET @txt = REPLACE(@txt, N'ة', N'ه'); is unnecessary to achieve what you want.

I preserved of "ات" at the end of Arabic by adding :

IF RIGHT(@txt, 2) = N'ات' AND LEN(@txt) > 2
        SET @txt = LEFT(@txt, LEN(@txt) - 2) + N'ات';
    ELSE
        SET @txt = REPLACE(@txt, N'ة', N'ه');

I updated the function as the following :

CREATE OR ALTER FUNCTION handle_letters(@txt AS nvarchar(50))
RETURNS nvarchar(50)
BEGIN
    SET @txt = REPLACE(@txt, N'أ', N'ا');
    SET @txt = REPLACE(@txt, N'إ', N'ا');
    SET @txt = REPLACE(@txt, N'ى', N'ا');
    SET @txt = REPLACE(@txt, N'آ', N'ا');
    SET @txt = REPLACE(@txt, N'ؤ', N'ء');
    SET @txt = REPLACE(@txt, N'ـ', N'');
   -- SET @txt = REPLACE(@txt, N'ة', N'ه');
    SET @txt = REPLACE(@txt, N'ُ', N'');
    SET @txt = REPLACE(@txt, N'َ', N'');
    SET @txt = REPLACE(@txt, N'ً', N'');
    SET @txt = REPLACE(@txt, N'ٌ', N'');
    SET @txt = REPLACE(@txt, N'ِ', N'');
    SET @txt = REPLACE(@txt, N'ٍ', N'');
    SET @txt = REPLACE(@txt, N'ّ', N'');
    SET @txt = REPLACE(@txt, N'ئ', N'ء');

    -- Preserve "ات" at the end of Arabic words
    IF RIGHT(@txt, 2) = N'ات' AND LEN(@txt) > 2
        SET @txt = LEFT(@txt, LEN(@txt) - 2) + N'ات';
    ELSE
        SET @txt = REPLACE(@txt, N'ة', N'ه');

    SET @txt = REPLACE(@txt, N'ُ', N'');
    SET @txt = REPLACE(@txt, N'َ', N'');
    SET @txt = REPLACE(@txt, N'ً', N'');
    SET @txt = REPLACE(@txt, N'ٌ', N'');
    SET @txt = REPLACE(@txt, N'ِ', N'');
    SET @txt = REPLACE(@txt, N'ٍ', N'');
    SET @txt = REPLACE(@txt, N'ّ', N'');
    SET @txt = REPLACE(@txt, N'ئ', N'ء');

    RETURN REPLACE(@txt, N'عبدا', N'عبد ا');
END

-- Create a test table
CREATE TABLE TestTable
(
    ID INT IDENTITY(1, 1),
    TextValue NVARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS
);

-- Insert some test data
INSERT INTO TestTable (TextValue) VALUES (N'كرة');
INSERT INTO TestTable (TextValue) VALUES (N'كرات');

-- Select the original values
SELECT TextValue FROM TestTable;
-- Output: كرة, كرات

-- Apply the handle_letters function
SELECT dbo.handle_letters(TextValue) AS HandledText FROM TestTable;
-- Output: كره, كرات

-- Change the collation of the TextValue column to Arabic_CI_AI
--ALTER TABLE TestTable
--ALTER COLUMN TextValue NVARCHAR(50) COLLATE Arabic_CI_AI;

-- Apply the handle_letters function again
SELECT dbo.handle_letters(TextValue) AS HandledText FROM TestTable;
-- Output: كره, كرات

TextValue
كرة
كرات

HandledText
كره
كرات

HandledText
كره
كرات

fiddle

While this would solve part of the problem, the truth is more general than the simple example given, the letter ت is not preceded by the letter ا always, it can be preceded by other letters for example بنت — Ossama Nasser, Jul 17 '23 at 10:17

Comparison and replacing of Arabic Letters is Incorrect for Specific Letter

1 Answers1