0

I'm looking to clean name columns by removing any numeric or special characters from the column values and leaving only alphabetic values (A-Z).

I've tried this so far to make the column all uppercase and trim leading/trailing whitespaces:

upper(trim([Name])) [Name]

I'm stuck on how to proceed from here, any help is appreciated!

TeflonMusk
  • 130
  • 8
  • 2
    This isn't particularly easy on SQL Server. What version are you using? Also, assuming that names only include the characters A-Z is a poor assumption. They don't. – Thom A Dec 02 '21 at 16:56
  • This is really something that should be done pre saving to the DB. In a UI layer for example. You can use Regular Expressions (depending on SQL version can build your own with CLR functions). – Brad Dec 02 '21 at 17:00
  • regular expression replace is very powerful in Oracle, but have big limitations in sql server. try to look in the link below, its been answered here https://stackoverflow.com/questions/1007697/how-to-strip-all-non-alphabetic-characters-from-string-in-sql-server – NNM Dec 02 '21 at 17:04
  • 1
    check this answer https://stackoverflow.com/a/26506268/9925593 – Pato Dec 02 '21 at 17:04
  • What is the datatype of `Name`? if `varchar` it might be viable to construct a `TRANSLATE` with all the unwanted characters replaced with a known one and then `REPLACE` to get rid of it. If `nvarchar` there would be too many "bad" characters though – Martin Smith Dec 02 '21 at 17:08
  • @MartinSmith It is varchar, can you explain more about this process/include psuedocode – TeflonMusk Dec 02 '21 at 17:10
  • `varchar` supports 256 characters. So construct a [`TRANSLATE`](https://learn.microsoft.com/en-us/sql/t-sql/functions/translate-transact-sql?view=sql-server-ver15) with the `characters` string being the 204 non English alphabet characters in your collation code page and a `translations` string being `SPACE(204)` - then replace spaces with an empty string – Martin Smith Dec 02 '21 at 17:12
  • @lptr - that's a nice solution! – Martin Smith Dec 02 '21 at 17:17
  • 1
    If these are names of people though you should not be doing this as Larnu comments. – Martin Smith Dec 02 '21 at 17:18
  • @MartinSmith .. the code will fail if there is a space before a letter and no more non-letter characters after the letter ..eg 'abc@#£&*() z' – lptr Dec 02 '21 at 17:27
  • Thanks all, I ended up using @Pato 's linked solution and it worked like a charm. The only modification I made was add 'space' to the 'KeepValues' variable to account for middle names. – TeflonMusk Dec 02 '21 at 17:34
  • Oh dear, my name is `Anne-Marie O'Brien-Macdonald`, so that becomes `AnneMarie OBrienMacdonald`, which is just ridiculous, see https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/ – Charlieface Dec 02 '21 at 18:43
  • @Charlieface this fixes this partially Set KeepValues = '%[^a-z^ ^-]%' but, the ' is an interesting case for sure. On one part, I don't want random, error entries of ' in the name, but I don't want to exclude legitimate cases of the ' use. – TeflonMusk Dec 02 '21 at 18:48
  • Fix the problem at source: if someone is constantly entering incorrect names then fire them. It's simply impossible to clean up this kind of data afterwards – Charlieface Dec 02 '21 at 18:49
  • @Charlieface in this job market?! – TeflonMusk Dec 02 '21 at 19:20
  • In this economic climate if your company would rather pay you so much more per hour to fix the data entry operator's issues I'd be considering looking elsewhere for work because they're not spending efficiently. – AlwaysLearning Dec 03 '21 at 00:06
  • @lptr - if you promote your comment to an answer I'll remove my one – Martin Smith Dec 03 '21 at 10:12
  • 1
    @MartinSmith…your answer is fine and it is better than my comment. – lptr Dec 03 '21 at 11:31

1 Answers1

1

Firstly if this is for peoples' names then this is a terrible idea.

Globally names can certainly contain spaces, punctuation characters, accented characters and non Latin characters (and even numbers).

Nonetheless there may be other legitimate use cases for wanting to "remove all non-alphabetic characters from a column"

@lptr posted an interesting idea in the comments that heavily inspires this answer.

First use TRANSLATE to get a list of characters that need to be cleaned. Then call TRANSLATE again using this list and replacing them all with a space. Finally remove all spaces and convert to uppercase as desired in your case.

DECLARE @t TABLE
  (
     colX VARCHAR(100)
  );

INSERT INTO @t
            (colX)
VALUES     ('@#£ab£cd&123x/=+xz'),
            ('%-+=/;:,.abc@#£&*()'),
            ('abc@#£&*() z')

SELECT *,
       Cleaned = UPPER(REPLACE(translate(colx, bad_chars, SPACE(len(bad_chars))), ' ', ''))
FROM   @t
       CROSS APPLY (VALUES(replace(translate(colx, 'abcdefghijklmnopqrstuvwxyz' COLLATE Latin1_General_100_CI_AS, replicate('a', 26)), 'a', '') + '~')) V(bad_chars) 

Returns

+---------------------+-------------------+---------+
|        colX         |     bad_chars     | Cleaned |
+---------------------+-------------------+---------+
| @#£ab£cd&123x/=+xz  | @#££&123/=+~      | ABCDXXZ |
| %-+=/;:,.abc@#£&*() | %-+=/;:,.@#£&*()~ | ABC     |
| abc@#£&*() z        | @#£&*() ~         | ABCZ    |
+---------------------+-------------------+---------+
Martin Smith
  • 438,706
  • 87
  • 741
  • 845
  • I'm trying to follow your SQL snippet, but I don't understand where you're pulling 'bad_chars' from, it looks undefined. Also creating the new table at the beginning is just for holding the sample data, right? – GreySage Mar 31 '22 at 16:15