For the basic ASCII letters like A-Z (as mentioned) and a (typical) UTF-8 or LATIN* encoding (or most others):
SELECT chr(c) AS letter
, sum(octet_length(col)
- octet_length(translate(col, chr(c), ''))) AS total_count
FROM generate_series (ascii('A'), ascii('Z')) c
CROSS JOIN tbl
GROUP BY 1;
translate()
works for single-character replacements and is a bit faster than replace()
- which you would use looking for multi-character strings.
In (typical) UTF-8 or LATIN* encoding, basic ASCII letters are represented with a single byte. This allows the faster function octet_length()
. To count characters encoded with more bytes, use length()
instead, which counts characters instead of bytes.
Also, we can conveniently generate a range of letters like A-Z with generate_series()
, because their byte-representation lines up in a continuous range in the mentioned encodings. Convert to integer
with ascii()
and back with chr()
.
Then CROSS JOIN
to your table (tbl
), measure the difference between original length and after removing the letter of interest, and sum.
But while counting many of the characters in your strings, this alternative approach is probably much faster:
SELECT letter, count(*) AS total_count
FROM tbl, unnest(string_to_array(col, NULL)) letter
WHERE ascii(letter) BETWEEN ascii('A') AND ascii('Z')
GROUP BY 1;
To count case-insensitive, throw in lower()
or upper()
:
FROM tbl, unnest(string_to_array(upper(col), NULL)) letter
To check for multiple non-continuous ranges of characters:
WHERE letter ~ '^[a-zA-Z]$' -- a-z and A-Z separately (case-sensitive)
Or a random selection:
WHERE 'abcXYZ' ~ letter
string_to_array()
with separator NULL
splits the string into an array of single characters, unnest()
(using implicit CROSS JOIN LATERAL
), filter the ones of interest (again, using their byte-representation to make it fast. Then simply count per character.
Related: