Count Of Distinct Characters In Column

Question

Say I have the following data set

Column1 (VarChar(50 or something))
Elias
Sails
Pails
Plane
Games

What I'd like to produce from this column is the following set:

LETTER     COUNT
E          3
L          4
I          3
A          5
S          5
And So On...

One solution I thought of was combining all strings into a single string, and then count each instance of the letter in that string, but that feels sloppy.

This is more an exercise of curiosity than anything else, but, is there a way to get a count of all distinct letters in a dataset with SQL?

score 3 · Accepted Answer · answered Jun 23 '14 at 20:28

I would do this by creating a table of your letters similar to:

CREATE TABLE tblLetter
(
  letter varchar(1)
);

INSERT INTO tblLetter ([letter])
VALUES
    ('a'),
    ('b'),
    ('c'),
    ('d'); -- etc

Then you could join the letters to your table where your data is like the letter:

select l.letter, count(n.col) Total
from tblLetter l
inner join names n
  on n.col like '%'+l.letter+'%'
group by l.letter;

See SQL Fiddle with Demo. This would give a result:

| LETTER | TOTAL |
|--------|-------|
|      a |     5 |
|      e |     3 |
|      g |     1 |
|      i |     3 |
|      l |     4 |
|      m |     1 |
|      p |     2 |
|      s |     4 |

score 1 · Answer 2 · edited May 23 '17 at 12:29

1

If you create a table of letters, like this:

create table letter (ch char(1));
insert into letter(ch) values ('A'),('B'),('C'),('D'),('E'),('F'),('G'),('H')
,('I'),('J'),('K'),('L'),('M'),('N'),('O'),('P')
,('Q'),('R'),('S'),('T'),('U'),('V'),('W'),('X'),('Y'),('Z');

you could do it with a cross join, like this:

select ch, SUM(len(str) - len(replace(str,ch,'')))
from letter
cross join test -- <<== test is the name of the table with the string
group by ch
having SUM(len(str) - len(replace(str,ch,''))) <> 0

Here is a running demo on sqlfiddle.

You can do it without defining a table by embedding a list of letters into a query itself, but the idea of cross-joining and grouping by the letter would remain the same.

Note: see this answer for the explanation of the expression inside the SUM.

edited May 23 '17 at 12:29

Community

1
1

answered Jun 23 '14 at 20:28

Sergey Kalinichenko

714,442
84
1,110
1,523

But why would you filter out non-relevant results only after the grouping and not before? `... WHERE len(str) - len(replace(str,ch,'')) <> 0 GROUP BY ch` might prove faster. – Andriy M Jun 24 '14 at 12:49
@AndriyM That shouldn't make much difference: in either case, the expression will be evaluated once per word/character pair to go into the `SUM`; that's where the real expense is. Filtering out zeros after `GROUP BY` is very cheap. In fact, it might prove to be cheaper than a `WHERE` clause, because this results in fewer zero checks (precisely 26, instead of `26 * wordCount`). – Sergey Kalinichenko Jun 24 '14 at 12:54

score 1 · Answer 3 · answered Jun 23 '14 at 23:35

To me, this is a problem almost tailored for a CTE (Thanks, Nicholas Carey, for the original, my fiddle here: http://sqlfiddle.com/#!3/44f77/8):

WITH cteLetters
AS
(
    SELECT
              1 AS CharPos,
              str,
              MAX(LEN(str)) AS MaxLen,
              SUBSTRING(str, 1, 1) AS Letter
    FROM
              test
    GROUP BY
              str,
              SUBSTRING(str, 1, 1)

    UNION ALL

    SELECT
              CharPos + 1,
              str,
              MaxLen,
              SUBSTRING(str, CharPos + 1, 1) AS Letter
    FROM
              cteLetters
    WHERE
              CharPos + 1 <= MaxLen
)

SELECT
          UPPER(Letter) AS Letter,
          COUNT(*) CountOfLetters
FROM
          cteLetters
GROUP BY
          Letter
ORDER BY
          Letter;

Use the CTE to calculate character positions and deconstruct each string. Then you can just aggregate from the CTE itself. No need for additional tables or anything.

Hi, I'm trying your solution but still not sure where do you get the characters from when you query cteLetters. I tried your sqlfiddle example but it doesn't seem to work — noor h, Jan 05 '20 at 10:53
@noorh Hi, the letters are coming from table test, which is set up in the build portion. Check sqlfiddle link, now updated to use SQL Server 2017 engine (code is identical, though): http://sqlfiddle.com/#!18/0ccb8/1 — VBlades, Jan 09 '20 at 00:06

John Turner · Answer 4 · 2014-06-23T22:08:08.173

This should work even if you have case sensitivity turned on.

The setup:

CREATE TABLE _test ( Column1 VARCHAR (50) )

INSERT _test (Column1) VALUES ('Elias'),('Sails'),('Pails'),('Plane'),('Games')

The work:

DECLARE @counter AS INT
DECLARE @results TABLE (LETTER VARCHAR(1),[COUNT] INT)

SET @counter=65  --ascii value for 'A'

WHILE ( @counter <=90 )  -- ascii value for 'Z'
BEGIN
    INSERT @results (LETTER,[COUNT])
    SELECT CHAR(@counter),SUM(LEN(UPPER(Column1)) - LEN(REPLACE(UPPER(Column1), CHAR(@counter),''))) FROM _test
    SET @counter=@counter+1
END

SELECT * FROM @results WHERE [Count]>0

score 0 · Answer 5 · answered Jun 23 '14 at 23:10

It's often useful to have a range or sequence table that gives you a source of large runs of contiguous sequential numbers, like this one covering the range -100,000–+100,000.

drop table dbo.range
go
create table dbo.range
(
  id int not null primary key clustered ,
)
go

set nocount on
go

declare @i int = -100000
while ( @i <= +100000 )
begin
  if ( @i > 0 and @i % 1000 = 0 ) print convert(varchar,@i) + ' rows'
  insert dbo.range values ( @i )
  set @i = @i + 1
end
go

set nocount off
go

Once you have such a table, you can do something like this:

select character = substring( t.some_column , r.id , 1 ) ,
       frequency = count(*)
from dbo.some_table t
join dbo.range      r on r.id between 1 and len( t.some_column )
group by substring( t.some_column , r.id , 1 )
order by 1

If you want to ensure case-insensitivity, just mix in the desired upper() or lower():

select character = upper( substring( t.some_column , r.id , 1 ) ) ,
       frequency = count(*)
from dbo.some_table t
join dbo.range      r on r.id between 1 and len( t.some_column )
group by upper( substring( t.some_column , r.id , 1 ) )
order by 1

Given your sample data:

create table dbo.some_table
(
  some_column varchar(50) not null
)
go

insert dbo.some_table values ( 'Elias' )
insert dbo.some_table values ( 'Sails' )
insert dbo.some_table values ( 'Pails' )
insert dbo.some_table values ( 'Plane' )
insert dbo.some_table values ( 'Games' )
go

The latter query above produces the following results:

character frequency
    A         5
    E         3
    G         1
    I         3
    L         4
    M         1
    N         1
    P         2
    S         5

Count Of Distinct Characters In Column

5 Answers5