Count the number of rows that contain a letter/number

Question

What I am trying to achieve is straightforward, however it is a little difficult to explain and I don't know if it is actually even possible in postgres. I am at a fairly basic level. SELECT, FROM, WHERE, LEFT JOIN ON, HAVING, e.t.c the basic stuff.

I am trying to count the number of rows that contain a particular letter/number and display that count against the letter/number.

i.e How many rows have entries that contain an "a/A" (Case insensitive)

The table I'm querying is a list of film names. All I want to do is group and count 'a-z' and '0-9' and output the totals. I could run 36 queries sequentially:

SELECT filmname FROM films WHERE filmname ilike '%a%'
SELECT filmname FROM films WHERE filmname ilike '%b%'
SELECT filmname FROM films WHERE filmname ilike '%c%'

And then run pg_num_rows on the result to find the number I require, and so on.

I know how intensive like is and ilike even more so I would prefer to avoid that. Although the data (below) has upper and lower case in the data, I want the result sets to be case insensitive. i.e "The Men Who Stare At Goats" the a/A,t/T and s/S wouldn't count twice for the resultset. I can duplicate the table to a secondary working table with the data all being strtolower and working on that set of data for the query if it makes the query simpler or easier to construct.

An alternative could be something like

SELECT sum(length(regexp_replace(filmname, '[^X|^x]', '', 'g'))) FROM films;

for each letter combination but again 36 queries, 36 datasets, I would prefer if I could get the data in a single query.

Here is a short data set of 14 films from my set (which actually contains 275 rows)

District 9
Surrogates
The Invention Of Lying
Pandorum
UP
The Soloist
Cloudy With A Chance Of Meatballs
The Imaginarium of Doctor Parnassus
Cirque du Freak: The Vampires Assistant
Zombieland
9
The Men Who Stare At Goats
A Christmas Carol
Paranormal Activity

If I manually lay out each letter and number in a column and then register if that letter appears in the film title by giving it an x in that column and then count them up to produce a total I would have something like this below. Each vertical column of x's is a list of the letters in that filmname regardless of how many times that letter appears or its case.

The result for the short set above is:

A  x x  xxxx xxx  9 
B       x  x      2 
C x     xxx   xx  6
D x  x  xxxx      6
E  xx  xxxxx x    8
F   x   xxx       4 
G  xx    x   x    4
H   x  xxxx  xx   7
I x x  xxxxx  xx  9
J                 0
K         x       0
L   x  xx  x  xx  6
M    x  xxxx xxx  8
N   xx  xxxx x x  8
O  xxx xxx x xxx  10
P    xx  xx    x  5
Q         x       1
R xx x   xx  xxx  7
S xx   xxxx  xx   8
T xxx  xxxx  xxx  10
U  x xx xxx       6
V   x     x    x  3
W       x    x    2
X                 0 
Y   x   x      x  3
Z          x      1 
0                 0  
1                 0  
2                 0 
3                 0
4                 0
5                 0
6                 0
7                 0
8                 0
9 x         x     1

In the example above, each column is a "filmname" As you can see, column 5 marks only a "u" and a "p" and column 11 marks only a "9". The final column is the tally for each letter.

I want to build a query somehow that gives me the result rows: A 9, B 2, C 6, D 6, E 8 e.t.c taking into account every row entry extracted from my films column. If that letter doesn't appear in any row I would like a zero.

I don't know if this is even possible or whether to do it systematically in php with 36 queries is the only possibility.

In the current dataset there are 275 entries and it grows by around 8.33 a month (100 a year). I predict it will reach around 1000 rows by 2019 by which time I will be no doubt using a completely different system so I don't need to worry about working with a huge dataset to trawl through.

The current longest title is "Percy Jackson & the Olympians: The Lightning Thief" at 50 chars (yes, poor film I know ;-) and the shortest is 1, "9".

I am running version 9.0.0 of Postgres.

Apologies if I've said the same thing multiple times in multiple ways, I am trying to get as much information out so you know what I am trying to achieve.

If you need any clarification or larger datasets to test with please just ask and I'll edit as needs be.

Suggestion are VERY welcome.

Edit 1

Erwin Thanks for the edits/tags/suggestions. Agree with them all.

Fixed the missing "9" typo as suggested by Erwin. Manual transcribe error on my part.

kgrittn, Thanks for the suggestion but I am not able to update the version from 9.0.0. I have asked my provider if they will try to update.

Response

Thanks for the excellent reply Erwin

Apologies for the delay in responding but I have been trying to get your query to work and learning the new keywords to understand the query you created.

I adjusted the query to adapt into my table structure but the result set was not as expected (all zeros) so I copied your lines directly and had the same result.

Whilst the result set in both cases lists all 36 rows with the appropriate letters/numbers however all the rows shows zero as the count (ct).

I have tried to deconstruct the query to see where it may be falling over.

The result of

SELECT DISTINCT id, unnest(string_to_array(lower(film), NULL)) AS letter
FROM  films

is "No rows found". Perhaps it ought to when extracted from the wider query, I'm not sure.

When I removed the unnest function the result was 14 rows all with "NULL"

If I adjust the function

COALESCE(y.ct, 0) to COALESCE(y.ct, 4)<br />

then my dataset responds all with 4's for every letter instead of zeros as explained previously.

Having briefly read up on COALESCE the "4" being the substitute value I am guessing that y.ct is NULL and being substituted with this second value (this is to cover rows where the letter in the sequence is not matched, i.e if no films contain a 'q' then the 'q' column will have a zero value rather than NULL?)

The database I tried this on was SQL_ASCII and I wondered if that was somehow a problem but I had the same result on one running version 8.4.0 with UTF-8.

Apologies if I've made an obvious mistake but I am unable to return the dataset I require.

Any thoughts?

Again, thanks for the detailed response and your explanations.

Your result table is obviously wrong, '9' occurs in two films, but only one is marked with an 'x'. — Erwin Brandstetter, May 10 '12 at 16:35
@George: Please read this page and then update to a more recent bug-fix level of 9.0: http://www.postgresql.org/support/versioning/ — kgrittn, May 10 '12 at 18:34

Erwin Brandstetter · Accepted Answer · 2012-05-14T09:32:44.530

This query should do the job:

Test case:

CREATE TEMP TABLE films (id serial, film text);
INSERT INTO films (film) VALUES
 ('District 9')
,('Surrogates')
,('The Invention Of Lying')
,('Pandorum')
,('UP')
,('The Soloist')
,('Cloudy With A Chance Of Meatballs')
,('The Imaginarium of Doctor Parnassus')
,('Cirque du Freak: The Vampires Assistant')
,('Zombieland')
,('9')
,('The Men Who Stare At Goats')
,('A Christmas Carol')
,('Paranormal Activity');

Query:

SELECT l.letter, COALESCE(y.ct, 0) AS ct
FROM  (
    SELECT chr(generate_series(97, 122)) AS letter  -- a-z in UTF8!
    UNION ALL
    SELECT generate_series(0, 9)::text              -- 0-9
    ) l
LEFT JOIN (
    SELECT letter, count(id) AS ct
    FROM  (
        SELECT DISTINCT  -- count film once per letter
               id, unnest(string_to_array(lower(film), NULL)) AS letter
        FROM   films
        ) x
    GROUP  BY 1
    ) y  USING (letter)
ORDER  BY 1;

This requires PostgreSQL 9.1! Consider the release notes:

Change string_to_array() so a NULL separator splits the string into characters (Pavel Stehule)

Previously this returned a null value.

You can use regexp_split_to_table(lower(film), ''), instead of unnest(string_to_array(lower(film), NULL)) (works in versions pre-9.1!), but it is typically a bit slower and performance degrades with long strings.
I use generate_series() to produce the [a-z0-9] as individual rows. And LEFT JOIN to the query, so every letter is represented in the result.
Use DISTINCT to count every film once.
Never worry about 1000 rows. That is peanuts for modern day PostgreSQL on modern day hardware.

You may also be interested in the extension [unaccent](http://www.postgresql.org/docs/current/interactive/unaccent.html). — Erwin Brandstetter, May 10 '12 at 22:27
Fortunately my tables in this example will never have any data with non-(a-z0-9) characters but nevertheness thank you for pointing this extention out. I wasn't aware of it and I can see it being useful in the future. Thanks. — George, May 12 '12 at 02:17
Unfortunately as explained in my edit/response to the question above, this returns rows with a count of all zeros for me. Any ideas? — George, May 13 '12 at 09:46
@George: My apologies. I had missed that my solution requires PostgreSQL 9.1. The alternative with regexp_split_to_table() works with older versions and is only a slightly slower. I amended my answer accordingly. — Erwin Brandstetter, May 14 '12 at 09:35
Again, Thank you. I will try to get my install updated to 9.1 or higher. In the meantime I will adjust to the regexp_split_to_table solution.. Cheers! Prost! — George, May 14 '12 at 10:30

Eelke · Answer 2 · 2012-05-10T16:34:17.617

A fairly simple solution which only requires a single table scan would be the following.

SELECT 
    'a', SUM( (title ILIKE '%a%')::integer),
    'b', SUM( (title ILIKE '%b%')::integer),
    'c', SUM( (title ILIKE '%c%')::integer)
FROM film

I left the other 33 characters as a typing exercise for you :)

BTW 1000 rows is tiny for a postgresql database. It's beginning to get large when the DB is larger then the memory in your server.

edit: had a better idea

SELECT chars.c, COUNT(title)
FROM (VALUES ('a'), ('b'), ('c')) as chars(c)
    LEFT JOIN film ON title ILIKE ('%' || chars.c || '%')
GROUP BY chars.c
ORDER BY chars.c

You could also replace the (VALUES ('a'), ('b'), ('c')) as chars(c) part with a reference to a table containing the list of characters you are interested in.

score 0 · Answer 3 · answered May 10 '12 at 16:45

This will give you the result in a single row, with one column for each matching letter and digit.

SELECT
  SUM(CASE WHEN POSITION('a' IN filmname) > 0 THEN 1 ELSE 0 END) AS "A",
  SUM(CASE WHEN POSITION('b' IN filmname) > 0 THEN 1 ELSE 0 END) AS "B",
  SUM(CASE WHEN POSITION('c' IN filmname) > 0 THEN 1 ELSE 0 END) AS "C",
  ...
  SUM(CASE WHEN POSITION('z' IN filmname) > 0 THEN 1 ELSE 0 END) AS "Z",
  SUM(CASE WHEN POSITION('0' IN filmname) > 0 THEN 1 ELSE 0 END) AS "0",
  SUM(CASE WHEN POSITION('1' IN filmname) > 0 THEN 1 ELSE 0 END) AS "1",
  ...
  SUM(CASE WHEN POSITION('9' IN filmname) > 0 THEN 1 ELSE 0 END) AS "9"
FROM films;

user unknown · Answer 4 · 2012-05-12T02:28:47.733

A similar approach like Erwins, but maybe more comfortable in the long run:

Create a table with each character you're interested in:

CREATE TABLE char (name char (1), id serial);
INSERT INTO char (name) VALUES ('a');
INSERT INTO char (name) VALUES ('b');
INSERT INTO char (name) VALUES ('c');

Then grouping over it's values is easy:

SELECT char.name, COUNT(*) 
  FROM char, film 
  WHERE film.name ILIKE '%' || char.name || '%' 
  GROUP BY char.name 
  ORDER BY char.name;

Don't worry about ILIKE.

I'm not 100% happy about using the keyword 'char' as table title, but hadn't had bad experiences so far. On the other hand it is the natural name. Maybe if you translate it to another language - like 'zeichen' in German, you avoid ambiguities.

Count the number of rows that contain a letter/number

4 Answers4

Linked