Count specific characters in a column

Question

I have a table with a list of titles. I am trying to figure out a way of creating a substring query that will let me count the number of times that a particular character occurs in the entire column. Such as, how many times does the letter 'A' occur? I am thinking of the substring since I want to know the count for letters A - I.

I need a new table that shows the substring letters (say A-Z) and next to them the total number of times that letter occurs in the entire column (not just in each row).

Does this answer your question? [Counting the number of occurrences of a substring within a string in PostgreSQL](https://stackoverflow.com/questions/36376410/counting-the-number-of-occurrences-of-a-substring-within-a-string-in-postgresql) — Mickael B., Apr 27 '20 at 01:33
Somewhat, but that gives the number of times A occurs in each row. I need a new table that lists A-Z as a substring and next to it the number of times that letter occurs in the entire column of the original table. — Sharquinn, Apr 27 '20 at 01:40

Erwin Brandstetter · Answer 1 · 2020-04-27T03:45:05.253

For the basic ASCII letters like A-Z (as mentioned) and a (typical) UTF-8 or LATIN* encoding (or most others):

SELECT chr(c) AS letter
     , sum(octet_length(col)
         - octet_length(translate(col, chr(c), ''))) AS total_count
FROM   generate_series (ascii('A'), ascii('Z')) c
CROSS  JOIN tbl
GROUP  BY 1;

translate() works for single-character replacements and is a bit faster than replace() - which you would use looking for multi-character strings.

In (typical) UTF-8 or LATIN* encoding, basic ASCII letters are represented with a single byte. This allows the faster function octet_length(). To count characters encoded with more bytes, use length() instead, which counts characters instead of bytes.

Also, we can conveniently generate a range of letters like A-Z with generate_series(), because their byte-representation lines up in a continuous range in the mentioned encodings. Convert to integer with ascii() and back with chr().

Then CROSS JOIN to your table (tbl), measure the difference between original length and after removing the letter of interest, and sum.

But while counting many of the characters in your strings, this alternative approach is probably much faster:

SELECT letter, count(*) AS total_count
FROM   tbl, unnest(string_to_array(col, NULL)) letter
WHERE  ascii(letter) BETWEEN ascii('A') AND ascii('Z')
GROUP  BY 1;

To count case-insensitive, throw in lower() or upper():

FROM   tbl, unnest(string_to_array(upper(col), NULL)) letter

To check for multiple non-continuous ranges of characters:

WHERE  letter ~ '^[a-zA-Z]$'  -- a-z and A-Z separately (case-sensitive)

Or a random selection:

WHERE  'abcXYZ' ~ letter

string_to_array() with separator NULL splits the string into an array of single characters, unnest() (using implicit CROSS JOIN LATERAL), filter the ones of interest (again, using their byte-representation to make it fast. Then simply count per character.

The code is case sensitive, yes. You did not mention case-insensitive ... I added a solution. — Erwin Brandstetter, Apr 27 '20 at 03:22
Thank you so very much for this. I feel so stupid in asking this, but how would it work for both upper and lower at the same time? — Sharquinn, Apr 27 '20 at 03:36
To count them separately, so case-sensitive again? I added some more above. — Erwin Brandstetter, Apr 27 '20 at 03:45
To count them as a complete total, all upper and lower e's at the same time, not separately. — Sharquinn, Apr 27 '20 at 03:49
The variant with `upper()` does just that. (That's what case-insensitive means.) — Erwin Brandstetter, Apr 27 '20 at 03:50

Count specific characters in a column

1 Answers1