Count distinct letters in a string in bigquery

Question

I have a string column in Biquery like:

select 'A' 
union all (select 'ab')
union all (select 'abc')
union all (select 'aa')
union all (select 'aab')

I would like to count the number of distinct characters in every row of the column, in this case the results would be:

Can this be done in BigQuery? How?

Does this answer your question? [Count number of unique characters in a string](https://stackoverflow.com/questions/29967280/count-number-of-unique-characters-in-a-string) — ggordon, Aug 04 '21 at 14:43
Not really, doesn't really solve the problem in a simple way, and it's not BigQuery, it's SQL — David Masip, Aug 04 '21 at 14:52

score 3 · Accepted Answer · answered Aug 04 '21 at 14:58

3

How about this (assuming you don't want to differentiate between uppercase and lowercase)...

WITH data AS (select 'A' AS `val`
              union all (select 'ab')
              union all (select 'abc')
              union all (select 'aa')
              union all (select 'aab'))
SELECT `val`, 26 - LENGTH(REGEXP_REPLACE('abcdefghijklmnopqrstuvwxyz', '['||LOWER(`val`)||']', ''))
FROM `data`;

answered Aug 04 '21 at 14:58

53epo

784
5
7

if you need to differentiate between uppercase and lowercase or 0-9, just update the source regex string and the length of this string. SELECT `val`, 52- LENGTH(REGEXP_REPLACE('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz', '['||`val`||']', '')) – Ka Boom Aug 04 '21 at 15:11

ggordon · Answer 2 · 2021-08-04T16:52:29.477

A simple approach is to use the SPLIT to convert your string to an array and UNNEST to convert the resulting array to a table. You may then use COUNT and DISTINCT to determine the number of unique characters as shown below:

with my_data AS (
   select 'A' as col
union all (select 'ab')
union all (select 'abc')
union all (select 'aa')
union all (select 'aab')
)

select col, (SELECT COUNT(*) FROM (
  SELECT DISTINCT element FROM UNNEST(SPLIT(col,'')) as element
)) n from my_data;

or simply

WITH my_data AS (
   SELECT 'A' as col UNION ALL 
   SELECT 'ab'       UNION ALL 
   SELECT 'abc'      UNION ALL 
   SELECT 'aa'       UNION ALL 
   SELECT 'aab'
)
SELECT 
    col, 
    (
        SELECT 
            COUNT(DISTINCT element) 
        FROM 
            UNNEST(SPLIT(col,'')) as element
    ) cnt 
FROM 
    my_data;

Timogavk · Answer 3 · 2021-08-04T16:45:30.880

1

Like previous but using COUNT with DISTINCT

with my_data AS (
select 'A' as col
union all (select 'ab')
union all (select 'abc')
union all (select 'aa')
union all (select 'aab')
)

select col, COUNT(DISTINCT element) FROM 
my_data,UNNEST(SPLIT(col,'')) as element
GROUP BY col

edited Aug 04 '21 at 16:45

answered Aug 04 '21 at 16:39

Timogavk

809
1
7
20

score 1 · Answer 4 · answered Aug 05 '21 at 08:20

If the data is not quite huge, I would rather go with the user-defined functions to ease up the string manipulation across different columns

CREATE TEMP FUNCTION
  get_unique_char_count(x STRING)
  RETURNS INT64
  LANGUAGE js AS r"""
    str_split = new Set(x.split(""));
    return str_split.size;
  """;
WITH
  result AS (
  SELECT
    'A' AS val
  UNION ALL (
    SELECT
      'ab')
  UNION ALL (
    SELECT
      'abc')
  UNION ALL (
    SELECT
      'aa')
  UNION ALL (
    SELECT
      'aab') )
SELECT
  val,
  get_unique_char_count(val) unique_char_count
FROM
  result

RESULT:

Count distinct letters in a string in bigquery

4 Answers4