Expression to find multiple spaces in string

Question

We handle a lot of sensitive data and I would like to mask passenger names using only the first and last letter of each name part and join these by three asterisks (***),

For example: the name 'John Doe' will become 'J***n D***e'

For a name that consists of two parts this is doable by finding the space using the expression:

LEFT(CardHolderNameFromPurchase, 1) + 
 '***' + 
 CASE WHEN CHARINDEX(' ', PassengerName) = 0 
      THEN RIGHT(PassengerName, 1) 
      ELSE SUBSTRING(PassengerName, CHARINDEX(' ', PassengerName) -1, 1) +
           ' ' + 
           SUBSTRING(PassengerName, CHARINDEX(' ', PassengerName) +1, 1) +
           '***' + 
           RIGHT(PassengerName, 1) 
 END

However, the passenger name can have more than two parts, there is no real limit to it. How should can I find the indices of all spaces within an expression? Or should I maybe tackle this problem in a different way?

Any help or pointer is much appreciated!

If you store the entire name in a single column you are fighting an impossible fight. How would you know where first name ends and the last name begins? By far the best approach would be to use two columns instead of one to hold their name. Anything you do with a single column is brittle and will not always work. — Sean Lange, Feb 19 '18 at 16:17
Parsing names is the oldest problem in databases maybe. There's no catch all solution because each is 100% reliant on your data. — Jacob H, Feb 19 '18 at 16:17
What would you expect if someone's name was say `'Edgar Allan Poe'`? Would you want to output to be `'E***r A***n P***e'`? What version of SQL Server? Also, if you're planning to do these to try and mask people's names, there are quite clever applications out there than could likely work out what the name actually is (especially with such a simple algorithm). — Thom A, Feb 19 '18 at 16:20
First guess: Pick a _string splitter_ that returns a table of results, apply a function to normalize whitespace (tabs, ... -> single blank), split on blanks, apply function to [X the paragrab](http://www.poedecoder.com/qrisse/works/xing.php), reassemble pieces, sigh contentedly. (That smells bad.) — HABO, Feb 19 '18 at 16:25
Possible duplicate of [how to split a string by space in SQL](https://stackoverflow.com/questions/40274714/how-to-split-a-string-by-space-in-sql) — Tab Alleman, Feb 19 '18 at 16:26
Here's some answers: SQL Server 2008. And yes we do also have Firstname Lastname columns, but in some cases these are combined and the information is lost. And if there are multiple name parts, e.g. `Edgar Allen Poe` the expected result is `E***r A***n P***e` indeed. — Fraukje, Feb 19 '18 at 16:29

score 5 · Accepted Answer · answered Feb 19 '18 at 16:32

This solution does what you want it to, but is really the wrong approach to use when trying to hide personally identifiable data, as per Gordon's explanation in his answer.

SQL:

declare @t table(n nvarchar(20));
insert into @t values('John Doe')
,('JohnDoe')
,('John Doe Two')
,('John Doe Two Three')
,('John O''Neill');

select n
      ,stuff((select ' ' + left(s.item,1) + '***' + right(s.item,1)
              from dbo.fn_StringSplit4k(t.n,' ',null) as s
              for xml path('')
              ),1,1,''
              ) as mask
from @t as t;

Output:

+--------------------+-------------------------+
|         n          |          mask           |
+--------------------+-------------------------+
| John Doe           | J***n D***e             |
| JohnDoe            | J***e                   |
| John Doe Two       | J***n D***e T***o       |
| John Doe Two Three | J***n D***e T***o T***e |
| John O'Neill       | J***n O***l             |
+--------------------+-------------------------+

String splitting function based on Jeff Moden's Tally Table approach:

create function [dbo].[fn_StringSplit4k]
(
  @str nvarchar(4000) = ' '    -- String to split.
 ,@delimiter as nvarchar(1) = ','  -- Delimiting value to split on.
 ,@num as int = null      -- Which value to return, null returns all.
)
returns table
as
return
     -- Start tally table with 10 rows.
 with n(n)   as (select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1)

     -- Select the same number of rows as characters in @str as incremental row numbers.
     -- Cross joins increase exponentially to a max possible 10,000 rows to cover largest @str length.
  ,t(t)   as (select top (select len(isnull(@str,'')) a) row_number() over (order by (select null)) from n n1,n n2,n n3,n n4)

     -- Return the position of every value that follows the specified delimiter.
  ,s(s)   as (select 1 union all select t+1 from t where substring(isnull(@str,''),t,1) = @delimiter)

     -- Return the start and length of every value, to use in the SUBSTRING function.
     -- ISNULL/NULLIF combo handles the last value where there is no delimiter at the end of the string.
  ,l(s,l) as (select s,isnull(nullif(charindex(@delimiter,isnull(@str,''),s),0)-s,4000) from s)

 select rn
          ,item
 from(select row_number() over(order by s) as rn
    ,substring(@str,s,l) as item
  from l
  ) a
 where rn = @num
  or @num is null;
GO

The 8K splitter of Jeff's that you linked too is far superior to the much older 4k version. — Sean Lange, Feb 19 '18 at 16:36
I don't think the OP is using `nvarchar`. If they were, I'd expect their example of what they'd tried to use an `nvarchar` for their string literals as well. Such as: `CHARINDEX(N' ', PassengerName) = 0` and `+ N'***'`. — Thom A, Feb 19 '18 at 16:43
@SeanLange Actually, it seems you are referencing something that is is not based on at all. The function I posted above is simply the 8k splitter ranged down to handle `nvarchar` values using the same logic as the 4k splitter. In my pre-2016 environments I typically have the one posted here, the 8k and an XML based splitter to handle `max` data types to be used as best suited. — iamdave, Feb 19 '18 at 16:46
Right you are. I don't how much the 4K version has kept up with the various performance improvements made over the years. This is a good solution either way and by no means do I want to stray off topic. — Sean Lange, Feb 19 '18 at 16:49

score 2 · Answer 2 · answered Feb 19 '18 at 16:29

If you consider PassengerName as sensitive information, then you should not be storing it in clear text in generally accessible tables. Period.

There are several different options.

One is to have reference tables for sensitive information. Any table that references this would have an id rather than the name. Viola. No sensitive information is available without access to the reference table, and that would be severely restricted.

A second method is a reversible compression algorithm. This would allow the the value to be gibberish, but with the right knowledge, it could be transformed back into a meaningful value. Typical methods for this are the public key encryption algorithms devised by Rivest, Shamir, and Adelman (RSA encoding).

If you want to do first and last letters of names, I would be really careful about Asian names. Many of them consist of two or three letters, when written in Latin script. That isn't much hiding. SQL Server does not have simple mechanisms to do this. You can write a user-defined function with a loop to manager the process. However, I view this as the least secure and least desirable approach.

Thanks for your considerations. However, this is merely a technical question and not a discussion on how sensitive information should be stored from my perspective. So if I understand you well, I should go for a user-defined function? — Fraukje, Feb 19 '18 at 16:31

score 2 · Answer 3 · answered Feb 19 '18 at 16:39

This uses Jeff Moden's DelimitedSplit8K, as well as the new functionality in SQL Server 2017 STRING_AGG. As I don't know what version you're using, I've just gone "whole hog" and assumed you're using the latest version.

Jeff's function is invaluable here, as it returns the ordinal position, something which Microsoft have foolishly omitted from their own function, STRING_SPLIT (and didn't add in 2017 either). Ordinal position is key here, so we can't make use of the built in function.

WITH VTE AS(
    SELECT *
    FROM (VALUES ('John Doe'),('Jane Bloggs'),('Edgar Allan Poe'),('Mr George W. Bush'),('Homer J Simpson')) V(FullName)),
Masking AS (
    SELECT *,
           ISNULL(STUFF(Item, 2, LEN(item) -2,'***'), Item) AS MaskedPart
    FROM VTE V
         CROSS APPLY dbo.delimitedSplit8K(V.Fullname, ' '))
SELECT STRING_AGG(MaskedPart,' ') AS MaskedFullName
FROM Masking
GROUP BY Fullname;

Edit: Nevermind, OP has commented they are using 2008, so STRING_AGG is out of the question. @iamdave, however, has posted an answer which is very similar to my own, just do it the "old fashioned XML way".

score 2 · Answer 4 · answered Feb 19 '18 at 16:44

Depending on your version of SQL Server, you may be able to use the built-in string split to rows on spaces in the name, do your string formatting, and then roll back up to name level using an XML path.

create table dataset (id int identity(1,1), name varchar(50));
insert into dataset (name) values
('John Smith'),
('Edgar Allen Poe'),
('One Two Three Four');

with split as (
select id, cs.Value as Name
from dataset
cross apply STRING_SPLIT (name, ' ') cs
),
formatted as (
select
  id,
  name,
  left(name, 1) + '***' + right(name, 1) as out
from split
)
SELECT 
   id, 
   (SELECT ' ' + out 
    FROM formatted b
    WHERE a.id = b.id
    FOR XML PATH('')) [out_name]
FROM formatted a
GROUP BY id

Result:

id   out_name
1    J***n S***h
2    E***r A***n P***e
3    O***e T***o T***e F***r

score 2 · Answer 5 · answered Feb 19 '18 at 17:01

You can do that using this function.

create function [dbo].[fnMaskName] (@var_name varchar(100))
RETURNS varchar(100)
WITH EXECUTE AS CALLER
AS
BEGIN
    declare @var_part varchar(100)
    declare @var_return varchar(100)
    declare @n_position smallint

    set @var_return = ''
    set @n_position = 1

    WHILE @n_position<>0
    BEGIN
        SET @n_position = CHARINDEX(' ', @var_name)
        IF @n_position = 0
            SET @n_position = LEN(@var_name)

        SET @var_part = SUBSTRING(@var_name, 1, @n_position)
        SET @var_name = SUBSTRING(@var_name, @n_position+1, LEN(@var_name))
        if @var_part<>''
            SET @var_return = @var_return + stuff(@var_part, 2, len(@var_part)-2, replicate('*',len(@var_part)-2)) + ' ' 
    END
    RETURN(@var_return)

END

A `WHILE` is going to be a very slow performing answer, especially when it's also inside a Scalar function. A dataset approach will be far more efficient. — Thom A, Feb 19 '18 at 17:05

Expression to find multiple spaces in string

5 Answers5