-2

I have a table with a varchar(255) field. I want to get the number of occurrences of each word in a group of rows from this table. My Table:

col1    col2    col3
123    A,B,C    Some Value
345      A      Some Value
567      B      Some Value
876    B,C      Some Value
890    C,A      Some Value
231    A,C      Some Value
456     B       Some Value
678     C       Some Value

I want Output Should be

Alpha   Count
A       4
B       4
C       5
Raging Bull
  • 18,593
  • 13
  • 50
  • 55
ansh
  • 573
  • 3
  • 9
  • 26
  • 4
    This is bad database design. You should normalise it. – trincot Jul 22 '17 at 12:18
  • As @trincot says this is a bad design, however as a possible way round I would look at first splitting the col with multiple values in rows, ref https://stackoverflow.com/questions/17942508/sql-split-values-to-multiple-rows then it would be a case of grouping the result set by the column with the split out words and counting. – Valeklosse Jul 22 '17 at 12:44

1 Answers1

0

I'm sure you're aware that putting comma-separated values into a single column of a single row is almost hopelessly bad database design. (If you're not sure about that, please read up on normalization.)

But almost hopeless isn't hopeless. You can pull out the individual items from the comma-separated strings, like this. The following expression retrieves the fourth item from your comma-separated list. If the list doesn't have four items, it retrieves an empty string.

SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',4),',',-1) w FROM tbl

You can read about SUBSTRING_INDEX() here. The inner invocation retrieves everying up to the fourth comma from col2. The outer invocation works backward from the end and retrieves everything up to the first comma. Behold your word. The extra commas stuck on the end of col2 keep the whole thing from going wrong for values of col2 with less than four commas.

Next, you need to use these expressions over and over to get the first, second, third, fourth, ... words from each row. You can put them in a UNION ALL query, like this.

SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',1),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',2),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',3),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',4),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',5),',',-1) w FROM tbl

Notice how each expression in turn retrieves the first, second, third, etc word.

Next, we need to wrap all this in an outer query to clean it up and get rid of the zero-length "words". That looks like this: (http://sqlfiddle.com/#!9/1cd35/4/0)

SELECT col1, TRIM(w) w
FROM (
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',1),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',2),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',3),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',4),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',5),',',-1) w FROM tbl
) words
WHERE w <> ''

Finally, we'll do an aggregate on all this to do your counting. (http://sqlfiddle.com/#!9/1cd35/6/0)

SELECT COUNT(*) wordcount, TRIM(w) w
FROM (
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',1),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',2),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',3),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',4),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',5),',',-1) w FROM tbl
) words
WHERE w <> ''
GROUP BY TRIM(w)
ORDER BY COUNT(*) DESC

Please notice that you can use this sort of query to create a normalized version of your data, in a table called words, like this:

CREATE TABLE words
SELECT col1, TRIM(w) word
FROM (
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',1),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',2),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',3),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',4),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',5),',',-1) w FROM tbl
) words
WHERE TRIM(w) <> ''

Then you can do

   SELECT COUNT(*) wordcount, word 
    FROM words 
   GROUP BY word
   ORDER BY COUNT(*) DESC, word

the easy way.


You probably noticed the query for retrieving your words is repetitive, with lots of separate SELECTs in the UNION ALL. There's one for every different possible number of comma-separated values in that column. Yeah. Almost hopeless sometimes gets a little nasty to handle. If you have some col2 values with ten thousand commas in them, you need to use a host program to rewrite your table in a normalized form.

O. Jones
  • 103,626
  • 17
  • 118
  • 172