0

How can you count how many times all the distinct words in a column appear

Below is an example and expected output

+--------+------------------------------+
| PERIOD |            STRING            |
+--------+------------------------------+
|        |                              |
| 1      | this is some text            |
|        |                              |
| 2      | more text                    |
|        |                              |
| 3      | this could be some more text |
+--------+------------------------------+

+-------+-------+
| WORD  | COUNT |
+-------+-------+
|       |       |
| this  | 2     |
|       |       |
| is    | 1     |
|       |       |
| some  | 2     |
|       |       |
| text  | 3     |
|       |       |
| more  | 2     |
|       |       |
| could | 1     |
|       |       |
| be    | 1     |
+-------+-------+

Thanks,

Barbaros Özhan
  • 59,113
  • 10
  • 31
  • 55
Peachman1997
  • 183
  • 1
  • 12

3 Answers3

1

You can use Hierarchical query such as

WITH t2 AS
(
 SELECT REGEXP_SUBSTR(LOWER(string),'[^[:space:]]+',1,level) AS word
   FROM t  
CONNECT BY level <= REGEXP_COUNT(LOWER(string),'[:space:]') + 1
    AND PRIOR SYS_GUID() IS NOT NULL
    AND PRIOR period = period
)    
SELECT word, COUNT(*) AS count
  FROM t2
 WHERE word IS NOT NULL
 GROUP BY word

Demo

P.S. LOWER() function is applied in order to get rid of problem related to case-sensitivity.

Barbaros Özhan
  • 59,113
  • 10
  • 31
  • 55
0

The trick is splitting the string into words. One method uses a recursive CTE:

with words(word, string, n) as (
      select regexp_substr(string, '[^ ]+', 1, 1) as word, string, 1 as n
      from t
      union all
      select regexp_substr(string, '[^ ]+', 1, n + 1), string, n + 1
      from words
      where regexp_substr(string, '[^ ]+', 1, n + 1) is not null
     )
select word, count(*)
from words
group by word;

Here is a db<>fiddle.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
0

You can do it without (slow) regular expressions using simple string functions:

WITH word_bounds ( string, start_pos, end_pos ) AS (
  SELECT string,
         1,
         INSTR( string, ' ', 1 )
  FROM   table_name
UNION ALL
  SELECT string,
         end_pos + 1,
         INSTR( string, ' ', end_pos + 1 )
  FROM   word_bounds
  WHERE  end_pos > 0
),
words ( word ) AS (
SELECT CASE end_pos
       WHEN 0
       THEN SUBSTR( string, start_pos )
       ELSE SUBSTR( string, start_pos, end_pos - start_pos )
       END
FROM   word_bounds
)
SELECT word,
       COUNT(*) AS frequency
FROM   words
GROUP BY
       word
ORDER BY
       frequency desc, word;

Which, for the sample data:

CREATE TABLE table_name ( PERIOD, STRING ) AS
SELECT 1, 'this is some text' FROM DUAL UNION ALL
SELECT 2, 'more text' FROM DUAL UNION ALL
SELECT 3, 'this could be some more text' FROM DUAL

Outputs:

WORD  | FREQUENCY
:---- | --------:
text  |         3
more  |         2
some  |         2
this  |         2
be    |         1
could |         1
is    |         1

There is a discussion on the performance of different ways of splitting delimited strings here.

db<>fiddle here

MT0
  • 143,790
  • 11
  • 59
  • 117