0

I have a string column in Biquery like:

select 'A' 
union all (select 'ab')
union all (select 'abc')
union all (select 'aa')
union all (select 'aab')

I would like to count the number of distinct characters in every row of the column, in this case the results would be:

1
2
3
1
2

Can this be done in BigQuery? How?

David Masip
  • 2,146
  • 1
  • 26
  • 46
  • Does this answer your question? [Count number of unique characters in a string](https://stackoverflow.com/questions/29967280/count-number-of-unique-characters-in-a-string) – ggordon Aug 04 '21 at 14:43
  • Not really, doesn't really solve the problem in a simple way, and it's not BigQuery, it's SQL – David Masip Aug 04 '21 at 14:52

4 Answers4

3

How about this (assuming you don't want to differentiate between uppercase and lowercase)...

WITH data AS (select 'A' AS `val`
              union all (select 'ab')
              union all (select 'abc')
              union all (select 'aa')
              union all (select 'aab'))
SELECT `val`, 26 - LENGTH(REGEXP_REPLACE('abcdefghijklmnopqrstuvwxyz', '['||LOWER(`val`)||']', ''))
FROM `data`;
53epo
  • 784
  • 5
  • 7
  • if you need to differentiate between uppercase and lowercase or 0-9, just update the source regex string and the length of this string. SELECT `val`, 52- LENGTH(REGEXP_REPLACE('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz', '['||`val`||']', '')) – Ka Boom Aug 04 '21 at 15:11
3

A simple approach is to use the SPLIT to convert your string to an array and UNNEST to convert the resulting array to a table. You may then use COUNT and DISTINCT to determine the number of unique characters as shown below:

with my_data AS (
   select 'A' as col
union all (select 'ab')
union all (select 'abc')
union all (select 'aa')
union all (select 'aab')
)

select col, (SELECT COUNT(*) FROM (
  SELECT DISTINCT element FROM UNNEST(SPLIT(col,'')) as element
)) n from my_data;

or simply

WITH my_data AS (
   SELECT 'A' as col UNION ALL 
   SELECT 'ab'       UNION ALL 
   SELECT 'abc'      UNION ALL 
   SELECT 'aa'       UNION ALL 
   SELECT 'aab'
)
SELECT 
    col, 
    (
        SELECT 
            COUNT(DISTINCT element) 
        FROM 
            UNNEST(SPLIT(col,'')) as element
    ) cnt 
FROM 
    my_data;
ggordon
  • 9,790
  • 2
  • 14
  • 27
1

Like previous but using COUNT with DISTINCT

with my_data AS (
select 'A' as col
union all (select 'ab')
union all (select 'abc')
union all (select 'aa')
union all (select 'aab')
)

select col, COUNT(DISTINCT element) FROM 
my_data,UNNEST(SPLIT(col,'')) as element
GROUP BY col
Timogavk
  • 809
  • 1
  • 7
  • 20
1

If the data is not quite huge, I would rather go with the user-defined functions to ease up the string manipulation across different columns

CREATE TEMP FUNCTION
  get_unique_char_count(x STRING)
  RETURNS INT64
  LANGUAGE js AS r"""
    str_split = new Set(x.split(""));
    return str_split.size;
  """;
WITH
  result AS (
  SELECT
    'A' AS val
  UNION ALL (
    SELECT
      'ab')
  UNION ALL (
    SELECT
      'abc')
  UNION ALL (
    SELECT
      'aa')
  UNION ALL (
    SELECT
      'aab') )
SELECT
  val,
  get_unique_char_count(val) unique_char_count
FROM
  result

RESULT:

enter image description here

Logan
  • 1,331
  • 3
  • 18
  • 41