I'm sure you're aware that putting comma-separated values into a single column of a single row is almost hopelessly bad database design. (If you're not sure about that, please read up on normalization.)
But almost hopeless isn't hopeless. You can pull out the individual items from the comma-separated strings, like this. The following expression retrieves the fourth item from your comma-separated list. If the list doesn't have four items, it retrieves an empty string.
SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',4),',',-1) w FROM tbl
You can read about SUBSTRING_INDEX()
here. The inner invocation retrieves everying up to the fourth comma from col2
. The outer invocation works backward from the end and retrieves everything up to the first comma. Behold your word. The extra commas stuck on the end of col2
keep the whole thing from going wrong for values of col2
with less than four commas.
Next, you need to use these expressions over and over to get the first, second, third, fourth, ... words from each row. You can put them in a UNION ALL query, like this.
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',1),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',2),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',3),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',4),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',5),',',-1) w FROM tbl
Notice how each expression in turn retrieves the first, second, third, etc word.
Next, we need to wrap all this in an outer query to clean it up and get rid of the zero-length "words". That looks like this: (http://sqlfiddle.com/#!9/1cd35/4/0)
SELECT col1, TRIM(w) w
FROM (
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',1),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',2),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',3),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',4),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',5),',',-1) w FROM tbl
) words
WHERE w <> ''
Finally, we'll do an aggregate on all this to do your counting. (http://sqlfiddle.com/#!9/1cd35/6/0)
SELECT COUNT(*) wordcount, TRIM(w) w
FROM (
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',1),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',2),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',3),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',4),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',5),',',-1) w FROM tbl
) words
WHERE w <> ''
GROUP BY TRIM(w)
ORDER BY COUNT(*) DESC
Please notice that you can use this sort of query to create a normalized version of your data, in a table called words
, like this:
CREATE TABLE words
SELECT col1, TRIM(w) word
FROM (
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',1),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',2),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',3),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',4),',',-1) w FROM tbl
UNION ALL
SELECT col1, SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(col2,',,,,,'),',',5),',',-1) w FROM tbl
) words
WHERE TRIM(w) <> ''
Then you can do
SELECT COUNT(*) wordcount, word
FROM words
GROUP BY word
ORDER BY COUNT(*) DESC, word
the easy way.
You probably noticed the query for retrieving your words is repetitive, with lots of separate SELECTs in the UNION ALL. There's one for every different possible number of comma-separated values in that column. Yeah. Almost hopeless sometimes gets a little nasty to handle. If you have some col2
values with ten thousand commas in them, you need to use a host program to rewrite your table in a normalized form.