4

Some weird characters are getting stored in one of the table. They seem to be coming from .csv feeds so I don't have much control over that.

Hello Kitty Essential Accessory Kit

How can I clean it and remove these characters. I am ok doing it at db level or in C#.

EDIT

As per the suggestions received in comments. I am also looking into what I can do to correct it at feed level. Here's more info on it.

  1. Feeds are from third party.
  2. I opened feed in notepad++ and checked the encoding menu I see dot in front of 'encode in ansi' so I believe that's the encoding of the file
  3. And that's how it appears in notepad++ "Hello Kitty Essential Accessory Kit"
  4. One strange thing though. when I search that row in powershel from csv file. and it comes up with the row. I don't see these weird characters there..
gotqn
  • 42,737
  • 46
  • 157
  • 243
Ankit
  • 1,867
  • 2
  • 21
  • 40
  • 4
    A wild guess is that the CSV is using custom double quotes (“” instead of "") and that you are not using the same encoding/character set in your database connection, causing them to appear as à and Â. Perhaps you could add an encoding parameter to your database connection string to prevent this from happening. – C.Evenhuis Jul 29 '15 at 06:41
  • To be clear, you're wanting to remove these characters completely, rather than just removing the diacritical marks from them? – Damien_The_Unbeliever Jul 29 '15 at 06:42
  • @Damien_The_Unbeliever yes I am ok with removing these characters completely so it will left me with 'Hello Kitty Essential Accessory Kit' – Ankit Jul 29 '15 at 06:43
  • What are the weird characters? Maybe you can define it's ascii range from 195 to x or something like that – CeOnSql Jul 29 '15 at 06:52
  • 2
    Agree with @C.Evenhuis: is some weird encoding problem. I think is a typical case in which solving the problem not at this root will cause troubles later – Felice Pollano Jul 29 '15 at 06:56
  • @FelicePollano I also agree on fixing the root. But it's actualy seems to be within the feed. Here's what I see in feed as well when I open it in notepad++ 'Hello Kitty Essential Accessory Kit' . So I don't really can do much about it.. – Ankit Jul 29 '15 at 07:02
  • That looks like you have UTF-8 encoding issue, and you probably should look into the file in which encoding it actually is and what encoding you're using when loading it into the database – James Z Jul 29 '15 at 07:09
  • @Ankit So fix it in the feed. Data like this shouldn't get into the database in the first place. – Luaan Jul 29 '15 at 07:24

4 Answers4

9

You can use .net regular expression functions. For example, using Regex.Replace:

Regex.Replace(s, @"[^\u0000-\u007F]", string.Empty);

As there is no support for regular expressions in SQL Server you need to create a SQL CLR function. More information about the .net integration in SQL Server can be found here:


In your case:

  1. Open Visual Studio and create Class Library Project:

    enter image description here

  2. Then rename the class to StackOverflow and paste the following code in its file:

    using Microsoft.SqlServer.Server;
    using System;
    using System.Collections.Generic;
    using System.Data.SqlTypes;
    using System.Linq;
    using System.Text;
    using System.Text.RegularExpressions;
    using System.Threading.Tasks;
    
    public class StackOverflow
    {
        [SqlFunction(DataAccess = DataAccessKind.None, IsDeterministic = true, Name = "RegexReplace")]
        public static SqlString Replace(SqlString sqlInput, SqlString sqlPattern, SqlString sqlReplacement)
        {
            string input = (sqlInput.IsNull) ? string.Empty : sqlInput.Value;
            string pattern = (sqlPattern.IsNull) ? string.Empty : sqlPattern.Value;
            string replacement = (sqlReplacement.IsNull) ? string.Empty : sqlReplacement.Value;
            return new SqlString(Regex.Replace(input, pattern, replacement));
        }
    }
    
  3. Now, build the project. Open the SQL Server Management Studio. Select your database and replace the path value of the following FROM clause to match your StackOverflow.dll:

    CREATE ASSEMBLY [StackOverflow] FROM 'C:\Users\gotqn\Desktop\StackOverflow\StackOverflow\bin\Debug\StackOverflow.dll';
    
  4. Finally, create the SQL CLR function:

    CREATE FUNCTION [dbo].[StackOverflowRegexReplace] (@input NVARCHAR(MAX),@pattern NVARCHAR(MAX), @replacement NVARCHAR(MAX))
    RETURNS NVARCHAR(4000)
    AS EXTERNAL NAME [StackOverflow].[StackOverflow].[Replace]
    GO
    

You are ready to use RegexReplace .net function directly in your T-SQL statements:

    SELECT [dbo].[StackOverflowRegexReplace] ('Hello Kitty Essential Accessory Kit', '[^\u0000-\u007F]', '')

    //Hello Kitty Essential Accessory Kit
gotqn
  • 42,737
  • 46
  • 157
  • 243
  • 1
    Accepting this as an answer as perhaps has given me flexibility to implement it at Sql Server or c# level. – Ankit Jul 30 '15 at 06:16
2

if you are looking for alphabets and numbers only in a string, than this can help you out.

In this, Regex is used to replace all characters other than alphabets and numbers.

Community
  • 1
  • 1
Tirthak Shah
  • 515
  • 2
  • 11
  • not really alphabets & numbers.. I just need to remove these weird looking characters, keeping everything else.. – Ankit Jul 29 '15 at 06:40
  • I've already provided links, which are now being posted as answer and not sure why I am being down voted. – Tirthak Shah Jul 29 '15 at 07:22
  • Link-only answers are frowned upon. In this case, you should simply cast a close vote - duplicate. Or at most, post this as a comment. – Luaan Jul 29 '15 at 07:23
  • @Luaan I was not having rights to do so, otherwise I'd definitely have down voted – Tirthak Shah Jul 29 '15 at 10:16
  • @TirthakShah Well, just avoid such questions until you do. There's a plenty of questions (and answerers) around here, there's nothing wrong with skipping over those that you can't answer / downvote / close properly. – Luaan Jul 29 '15 at 10:30
2

This seems to work:

string input = "Hello Kitty Essential Accessory Kit";
string res = Regex.Replace(input, @"[^a-zA-Z0-9\s]", "");

Console.WriteLine(res); // Hello Kitty Essential Accessory Kit
w.b
  • 11,026
  • 5
  • 30
  • 49
  • 2
    a little too eager IMHO, you remove .,; that can be in the field as well. By a general point of view it could even be possible that sometimes an  would be actual part of the field... the problem has to be solved CSV side IMHO – Felice Pollano Jul 29 '15 at 07:08
  • @w.b can you modify your answer to include .;: please. That should serve me as an answer then.. – Ankit Jul 30 '15 at 05:53
  • infact to include all characters we se on a keyboard. – Ankit Jul 30 '15 at 06:05
1

Try this:

DECLARE @str VARCHAR(400)
    DECLARE @expres  VARCHAR(50) = '%[~,@,#,$,%,&,*,(,),.,!,Ã,Â]%'
      SET @str = 'Hello Kitty Essential Accessory Kit'
      WHILE PATINDEX( @expres, @str ) > 0
          SET @str = Replace(REPLACE( @str, SUBSTRING( @str, PATINDEX( @expres, @str ), 1 ),''),'-',' ')

      SELECT @str

Just add any special char you want to get rid off in @expres variable.

Hope this helps!

Abhay Chauhan
  • 404
  • 3
  • 11