Count how many times distinct words appear in a column Oracle 12c SQL

Question

How can you count how many times all the distinct words in a column appear

Below is an example and expected output

+--------+------------------------------+
| PERIOD |            STRING            |
+--------+------------------------------+
|        |                              |
| 1      | this is some text            |
|        |                              |
| 2      | more text                    |
|        |                              |
| 3      | this could be some more text |
+--------+------------------------------+

+-------+-------+
| WORD  | COUNT |
+-------+-------+
|       |       |
| this  | 2     |
|       |       |
| is    | 1     |
|       |       |
| some  | 2     |
|       |       |
| text  | 3     |
|       |       |
| more  | 2     |
|       |       |
| could | 1     |
|       |       |
| be    | 1     |
+-------+-------+

Thanks,

Do you want to do this in pure SQL, or can you use language like PL/SQL ? — SuperPoney, Nov 23 '20 at 13:17
btw, having capital letters vs. small letters such as `This` vs. `this` matter ? eg. are both equal or not? — Barbaros Özhan, Nov 23 '20 at 13:39
@BarbarosÖzhan Cases does not matter so both would be equal — Peachman1997, Nov 23 '20 at 13:40
Almost a duplicate of https://stackoverflow.com/q/38371989/1509264 just with an added `COUNT` step at the end. — MT0, Nov 23 '20 at 13:51

Barbaros Özhan · Answer 1 · 2020-11-23T13:49:02.683

You can use Hierarchical query such as

WITH t2 AS
(
 SELECT REGEXP_SUBSTR(LOWER(string),'[^[:space:]]+',1,level) AS word
   FROM t  
CONNECT BY level <= REGEXP_COUNT(LOWER(string),'[:space:]') + 1
    AND PRIOR SYS_GUID() IS NOT NULL
    AND PRIOR period = period
)    
SELECT word, COUNT(*) AS count
  FROM t2
 WHERE word IS NOT NULL
 GROUP BY word

Demo

P.S. LOWER() function is applied in order to get rid of problem related to case-sensitivity.

score 0 · Answer 2 · answered Nov 23 '20 at 13:32

The trick is splitting the string into words. One method uses a recursive CTE:

with words(word, string, n) as (
      select regexp_substr(string, '[^ ]+', 1, 1) as word, string, 1 as n
      from t
      union all
      select regexp_substr(string, '[^ ]+', 1, n + 1), string, n + 1
      from words
      where regexp_substr(string, '[^ ]+', 1, n + 1) is not null
     )
select word, count(*)
from words
group by word;

Here is a db<>fiddle.

score 0 · Answer 3 · answered Nov 23 '20 at 13:50

You can do it without (slow) regular expressions using simple string functions:

WITH word_bounds ( string, start_pos, end_pos ) AS (
  SELECT string,
         1,
         INSTR( string, ' ', 1 )
  FROM   table_name
UNION ALL
  SELECT string,
         end_pos + 1,
         INSTR( string, ' ', end_pos + 1 )
  FROM   word_bounds
  WHERE  end_pos > 0
),
words ( word ) AS (
SELECT CASE end_pos
       WHEN 0
       THEN SUBSTR( string, start_pos )
       ELSE SUBSTR( string, start_pos, end_pos - start_pos )
       END
FROM   word_bounds
)
SELECT word,
       COUNT(*) AS frequency
FROM   words
GROUP BY
       word
ORDER BY
       frequency desc, word;

Which, for the sample data:

CREATE TABLE table_name ( PERIOD, STRING ) AS
SELECT 1, 'this is some text' FROM DUAL UNION ALL
SELECT 2, 'more text' FROM DUAL UNION ALL
SELECT 3, 'this could be some more text' FROM DUAL

Outputs:

WORD  | FREQUENCY
:---- | --------:
text  |         3
more  |         2
some  |         2
this  |         2
be    |         1
could |         1
is    |         1

There is a discussion on the performance of different ways of splitting delimited strings here.

db<>fiddle here

Count how many times distinct words appear in a column Oracle 12c SQL

3 Answers3

Linked