0

I would like to count the occurrences of all words in a column. The tricky part is that words in a row can appear in long stretches; meaning there are many spaces in-between.

This is a dummy example:

column_name

aaa bbb ccc ddd
[aaa]
bbb
bbb

So far I managed to use the following code

SELECT column_name, 
    SUM(LEN(column_name) - LEN(REPLACE(column_name, ' ', ''))+1) as counts
FROM 
    dbo.my_own
GROUP BY 
    column_name

The code gives me smth like this

column_name         counts

aaa bbb ccc ddd     1
[aaa]               1
bbb                 2

However, my desired output is:

column_name         counts

aaa                 1
[aaa]               1
bbb                 3
ccc                 1
ddd                 1
Dario Seidl
  • 4,140
  • 1
  • 39
  • 55
Jo-Achna
  • 315
  • 1
  • 3
  • 14

2 Answers2

1

In SQL Server, you would use string_split():

select s.value as word, count(*)
from dbo.my_own o cross apply
     string_split(o.column_name, ' ') s
where s.value <> ''
group by s.value;

String manipulation is highly database-dependent. Most databases have some method for doing this, but they can be quite different.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • One more thing, I'm trying to order the number of words and used `group by word DESC`. Any variation of it is not working. I'm not sure why not? – Jo-Achna Apr 11 '20 at 13:13
0

First, take a look at this question to see how to split the words in your column into multiple rows. In that question the words are separated by comma, but, of course, it works the same with spaces.

For your case, assuming a table tablename with an id and your words in columnname, where you have at most 4 words in the column, it would look like this:

SELECT
    tablename.id,
    SUBSTRING_INDEX(SUBSTRING_INDEX(tablename.columnname, ' ', numbers.n), ' ', -1) columnname
FROM
    (SELECT 1 AS n UNION ALL
     SELECT 2 UNION ALL
     SELECT 3 UNION ALL
     SELECT 4) numbers
     INNER JOIN tablename
     ON LENGTH(tablename.columnname) - LENGTH(REPLACE(tablename.columnname, ' ', '')) >= numbers.n - 1
ORDER BY
    id, n

Then, you can simply count the words:

SELECT columnname, count(*) FROM (
    SELECT
        tablename.id,
        SUBSTRING_INDEX(SUBSTRING_INDEX(tablename.columnname, ' ', numbers.n), ' ', -1) columnname
    FROM
        (SELECT 1 AS n UNION ALL
         SELECT 2 UNION ALL
         SELECT 3 UNION ALL
         SELECT 4) numbers
         INNER JOIN tablename
         ON LENGTH(tablename.columnname) - LENGTH(REPLACE(tablename.columnname, ' ', '')) >= numbers.n - 1
    ORDER BY
        id, n
) normalized
GROUP BY columnname

If you have more than 4 words in your column, you need to expand the select from numbers accordingly.

Edit: Oh, I am late, and I assumed MySQL.

Dario Seidl
  • 4,140
  • 1
  • 39
  • 55
  • Thank you for the reply. Just out of curiosity. In your code, I have to know what is the maximum number of words (in dummy example I just gave 4), since my data base is large, I honestly did not know how many "unique" words I have there. Do you think, there is a way of not giving it manually the way you did and also run the code? – Jo-Achna Apr 11 '20 at 12:29
  • For this approach you need some source of the numbers 1 to the maximum of words in your column for the `substring_index`. You can use some other method to generate a large sequence, for example: https://stackoverflow.com/a/39918568/401712 – Dario Seidl Apr 11 '20 at 15:32