2

I am trying to find 4 or more consecutive repetitive characters in a column using SQL server. Any help on this would be much appreciated.

My data:

CompanyName
HSBC Inc
Barcccclays
AAAAA
Testtttt
Tesco Plc

My output should be as follows:

CompanyName
Barcccclays
AAAAA
Testtttt

Thanks in advance!

Lukasz Szozda
  • 162,964
  • 23
  • 234
  • 275
user7617078
  • 65
  • 4
  • 10

3 Answers3

4

Yes it is possible, but I would not do it in SQL Server without Regex function:

CREATE TABLE tabC(CompanyName VARCHAR(100));

INSERT INTO tabC(CompanyName) 
    SELECT 'HSBC Inc' UNION ALL
    SELECT 'Barcccclays' UNION ALL
    SELECT 'AAAAA' UNION ALL
    SELECT 'Testtttt' UNION ALL
    SELECT 'Tesco Plc';

WITH mul AS
(
   SELECT REPLICATE(CHAR(32 + N), 4) AS val
   FROM (select top 95 row_number() over(order by t1.number) as N
         from   master..spt_values t1)  AS s
)
SELECT *
FROM tabC c
WHERE LEN(CompanyName) > 4
  AND EXISTS (SELECT 1
              FROM mul
              WHERE CHARINDEX(mul.val,c.CompanyName) > 0)

RextesterDemo

How it works:

1) Generate replicated values like 'aaaa', 'bbbb', 'cccc' and so on

2) Check if your string contains it.

Warning!

This solution checks only ASCII characters from 32 to 126.

EDIT:

How can incorporate above code like this: select @flag = 1 from tabc where 1 = (WITH mul AS ( SELECT REPLICATE(CHAR(32 + N), 4) AS val FROM (select top 95 row_number() over(order by t1.number) as N from master..spt_values t1) AS s ) SELECT * FROM tabC c WHERE LEN(CompanyName) > 4 AND EXISTS (SELECT 1 FROM mul WHERE CHARINDEX(mul.val,c.CompanyName) > 0)).

I'm getting this error:If this statement is a common table expression, or a change tracking context clause, the previous statement must be terminated with a semicolon. pls help

If you need to use it in context where you cannot use CTE change it to subquery.

SELECT *
FROM tabC c
WHERE LEN(CompanyName) > 4
  AND EXISTS (SELECT 1
              FROM (SELECT REPLICATE(CHAR(32 + N), 4) AS val
                    FROM (select top 95 row_number() over(order by t1.number) as N
                          from master..spt_values t1)  AS s) mul
              WHERE CHARINDEX(mul.val,c.CompanyName) > 0)
Lukasz Szozda
  • 162,964
  • 23
  • 234
  • 275
  • Hi, How can incorporate above code like this: select @flag = 1 from tabc where 1 = (WITH mul AS ( SELECT REPLICATE(CHAR(32 + N), 4) AS val FROM (select top 95 row_number() over(order by t1.number) as N from master..spt_values t1) AS s ) SELECT * FROM tabC c WHERE LEN(CompanyName) > 4 AND EXISTS (SELECT 1 FROM mul WHERE CHARINDEX(mul.val,c.CompanyName) > 0)). I'm getting this error:If this statement is a common table expression, or a change tracking context clause, the previous statement must be terminated with a semicolon. pls help. – user7617078 Aug 04 '17 at 12:57
  • 1
    Thanks so much! Much appreciated. – user7617078 Aug 21 '17 at 14:56
  • When I change EXISTS to NOT-EXISTS, it's giving me error: 'Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.'. How could I correct this? TIA! – user7617078 Aug 21 '17 at 16:23
  • @user7617078 I suggest asking new question with your full query and sample data. – Lukasz Szozda Aug 21 '17 at 16:24
  • Sorry, Here is the code : SELECT * FROM tabC c WHERE LEN(CompanyName) > 4 AND NOT EXISTS (SELECT 1 FROM (SELECT REPLICATE(CHAR(32 + N), 4) AS val FROM (select top 95 row_number() over(order by t1.number) as N from master..spt_values t1) AS s) mul WHERE CHARINDEX(mul.val,c.CompanyName) > 0) – user7617078 Aug 21 '17 at 16:25
  • @user7617078 It is working http://rextester.com/OZBKIK11318 Both EXISTS and NOT EXISTS. – Lukasz Szozda Aug 21 '17 at 16:27
  • I have a stored proc which executes above query like this: – user7617078 Aug 21 '17 at 16:29
  • @user7617078 Please ask new question. This is not the way that SO works. Your original question has been solved. – Lukasz Szozda Aug 21 '17 at 16:31
  • Thanks! Posted a new question. – user7617078 Aug 22 '17 at 09:51
2

I would say the simplest solution would be to enable SQL CLR and implement your condition as user-defined function in .NET. There you have the power of Regular expressions to use (ex: the Regex class).

The regex you would need would be along the lines of (.)\1{3}, which matches any character followed by the same charcter at least 3 times.

For example the UDF could be written in C# like this:

using Microsoft.SqlServer.Server;
using System;
using System.Data.SqlTypes;
using System.Text.RegularExpressions;

public partial class UserDefinedFunctions
{
    [SqlFunction(DataAccess = DataAccessKind.None, IsDeterministic = true, IsPrecise = true)]
    public static SqlBoolean ContainsRepeatingCharacters(SqlString input)
    {
        const String pattern = @"(.)\1{3}";     
        Boolean hasMatch = Regex.IsMatch(input.ToString(), pattern);
        return new SqlBoolean(hasMatch);
    }
}

You could then use it like any other UDF:

select companyName
from companies
where dbo.ContainsRepeatingCharacters(companyName) = 1

Implementing the same condition in pure TSQL is by all means possible just it would probably be a mess to read and maintain. For example, consider the brute force solution in this post and think what happens when the desired condition should be changed in future even just a bit more complex.

Imre Pühvel
  • 4,468
  • 1
  • 34
  • 49
  • `Implementing same condition in pure TSQL is by all means possible just it would probably be a mess to read` as always it depends :) – Lukasz Szozda Aug 02 '17 at 11:44
  • A CLR is complete overkill for something like this. A good Ngrams function or Numbers table along with a basic understanding of REPLICATE and CHARINDEX is really all that's required. – Alan Burstein Aug 02 '17 at 17:05
  • @AlanBurstein, if requirements never change - maybe it is overkill. But I would say SQLCLR is an under-used feature and a little promotion for other similar problems is in order. If you are after productivity of maintainability then I would still prefer testing-debugging-changing SQLCLR over TSQL. YMMV. – Imre Pühvel Aug 03 '17 at 06:07
0

If you have a copy of NGrams8K handy this is a piece of cake. Both solutions below are simple, require no CLR integration, and will blow the doors off of any regex alternative with respect to performance.

-- sample data
DECLARE @table table 
(
  CompanyName varchar(100) 
  unique /* included to simulate an index on this column */
);
INSERT @table values('HSBC Inc'),('Barcccclays'),('AAAAA'),('Testtttt'),('Tesco Plc');

-- solution #1
SELECT companyname
FROM @table
CROSS APPLY dbo.ngrams8K(CompanyName,1)
WHERE charindex(replicate(token,4), CompanyName) > 0
GROUP BY CompanyName;

-- solution #2 (if you want to return a boolean)
SELECT companyname, has4ConsecChars = max(sign(charindex(replicate(token,4), CompanyName)))
FROM @table
CROSS APPLY dbo.ngrams8K(CompanyName,1)
GROUP BY CompanyName;

Results

companyname
---------------
AAAAA
Barcccclays
Testtttt   

companyname     has4ConsecChars
--------------- ---------------
AAAAA           1
Barcccclays     1
HSBC Inc        0
Tesco Plc       0
Testtttt        1
Alan Burstein
  • 7,770
  • 1
  • 15
  • 18