2

In my project, sometimes it happens that user enters non-english characters in name column. I want to search multiple tables in my database for non-english characters and delete them or replace with '' (empty string).

I've already gone through following links but still answer is not clear:

  1. Replacing multiple characters in SQL
  2. Replacing characters in access SQL
  3. MSDN: Pattern matching using Regular Expression

Looking for a SQL script that will do this task. Currently I'm using the replace method, something like :

update TBLDAILYROUTINE a   set a.NAME = replace(a.NAME,'é','')
Community
  • 1
  • 1
Rohit
  • 6,365
  • 14
  • 59
  • 90
  • 3
    Shouldn't the app should validate the data before it gets into the database? – StuartLC Feb 09 '12 at 09:32
  • 1
    Exactly, put validation while the user enters all those values. Here you can use any method. But the most reliable is Regex. Database on the other hand is just a place of storage. Don't make it to make any processing except CRUD. – Oybek Feb 09 '12 at 09:37
  • I agree with client validation, but maybe this is not an option and it could be bypassed by "smart" users. Server validation/replacing just can't fail. – Thomas Feb 09 '12 at 09:42
  • @Thomas Then the developer must be "smarter" than "smart" users. All in all developers has much more previlidges and flexibility to constraint the user )) – Oybek Feb 09 '12 at 09:51
  • Correct! But sometimes you just can't because dumb users can always find a way to break things :-) Just an example: if the input is a web form the user can always view html source and reproduce input fields bypassing client validation... – Thomas Feb 09 '12 at 09:59
  • Well, our application is smart enough to process these characters and even our application have internationalization support but for some reasons processor is having issues in accepting such characters (which is even fixed now) but I need to support previous customer issues so thats why I need a script which will remove all such non-english characters. I think that clears your thoughts about the requirement. – Rohit Feb 09 '12 at 10:16

3 Answers3

2

As others have commented it's best to make the validation and replacements before the data enters the database.

In order to clean up existing data the easiest way would probably be to use a SQL CLR UDF that can be run on the columns needed to be cleaned up.

A quick and dirty C# snippet to clean off any non ASCII characters (with the side effect that '?' is also removed):

public static string StripNonAsciiContent(string source)
{
    Encoding utf8 = new UTF8Encoding();
    Encoding ascii = Encoding.ASCII;

    byte[] input = utf8.GetBytes(source); 
    byte[] output = Encoding.Convert(utf8, ascii, input);

    return ascii.GetString(output).Replace("?", "");
}

A better approach would be to write a Regular expression that replaces all characters outside of the allowed range:

public static string StripNotAllowedCharacters(string source)
{
    return Regex.Replace(source, @"[^a-zA-Z-]", "");
}
PHeiberg
  • 29,411
  • 6
  • 59
  • 81
0

How abount using a function? Create the function:

CREATE FUNCTION dbo.sf_ReplaceChars (@MyString NVARCHAR(MAX))
RETURNS NVARCHAR(50)
AS
BEGIN
    DECLARE @tmpString NVARCHAR(100)
    SET @tmpString = @MyString 
    SET @tmpString = LOWER(@tmpString)
    SET @tmpString =  REPLACE(@tmpString,'à', 'a')
    SET @tmpString =  REPLACE(@tmpString,'è', 'e')
    SET @tmpString =  REPLACE(@tmpString,'é', 'e')
    SET @tmpString =  REPLACE(@tmpString,'ì', 'i')
    SET @tmpString =  REPLACE(@tmpString,'ò', 'o')
    SET @tmpString =  REPLACE(@tmpString,'ù', 'u')
    SET @tmpString =  REPLACE(@tmpString,'ç', 'c')
    RETURN @tmpString
END
GO

then use it whenever needed, assuming the user input is stored in variable @userInput:

INSERT INTO myTable (text) VALUES select dbo.sf_ReplaceChars(@userInput)

You can add more REPLACE to the function to cover all your needs.

Regards Thomas

Thomas
  • 154
  • 6
  • Error prone. What if there will be ó, ñ or even қғҳ? – Oybek Feb 09 '12 at 09:38
  • You'll end up with infinite numbers of replaces and will never reach the end. This in turn will slow down your server when there will be too much replacements. – Oybek Feb 09 '12 at 09:40
  • Of course you are right, but this is to fix the user input: how many input there will be? By the way he could also set up a stored proc to nightly correct every column of every table... – Thomas Feb 09 '12 at 09:50
  • thats what I'm trying to do but need any optimized (and somewhat easy too :) ) way to do that. – Rohit Feb 09 '12 at 11:13
0

Run the bellow query:

declare @table varchar(50)
declare @column varchar(50)
declare @updateCommand varchar(50)

DECLARE AllMyColumns CURSOR FOR
    SELECT top 1 OBJECT_NAME(c.OBJECT_ID) TableName, c.name ColumnName
    FROM sys.columns AS c
    JOIN sys.types AS t ON c.user_type_id=t.user_type_id
    WHERE t.name = 'varchar' --select your datatype
    ORDER BY c.OBJECT_ID
OPEN AllMyColumns
FETCH NEXT FROM AllMyColumns into @table, @column;

WHILE @@FETCH_STATUS = 0 BEGIN
     set @updateCommand = 'update '+ @table + ' set '+ @column + ' = replace('+ @column +',''é'','''')'

    exec(@updateCommand)
    FETCH NEXT FROM AllMyColumns into @table, @column;
end

close AllMyColumns
DEALLOCATE AllMyColumns

I edited with the update command you suggested

Diego
  • 34,802
  • 21
  • 91
  • 134