12

In a MySQL 5.6 database I have table tablename which has (including others) three TEXT columns: col_a, col_b, col_c.

I want to extract all unique words (with words being separated by spaces) from these three columns that are at least 5 characters long. By "word" I mean any string of non-space characters, eg "foo-123" would be a word, as would "099423". The columns are all utf8 format InnoDB columns.

Is there a single query to do this?

EDIT: As requested, here's an example: (in the real data col_a, col_b and col_c are TEXT fields and could have a large number of words.)

select id, col_a, col_b, col_c from tablename;

id  | col_a              | col_b          | col_c
----|--------------------|----------------|----------------------
1   | apple orange plum  | red green blue | bill dave sue
2   | orange plum banana | yellow red     | frank james
3   | kiwi fruit apple   | green pink     | bill sarah-jane frank

expected_result: ["apple", "orange", "banana", "fruit", 
                  "green", "yellow", "frank", "james", "sarah-jane"]

I don't care about the order of results. thanks!

EDIT: in my example above, everything is in lowercase, as that's how I happen to store everything in my real-life table that this question relates to. But, for the sake of argument, if it did contain some capitalisation I would prefer the query to ignore capitalisation (this is the setting of my DB config as it happens).

EDIT2: in case it helps, all of the text columns have a FULLTEXT index on them.

EDIT3: here is the SQL to create the sample data:

DROP TABLE IF EXISTS `tablename`;
CREATE TABLE `tablename` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `col_a` text,
  `col_b` text,
  `col_c` text,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=utf8;
LOCK TABLES `tablename` WRITE;
INSERT INTO `tablename` VALUES (1,'apple orange plum','red green blue','bill dave sue'),(2,'orange plum banana','yellow red','frank james'),(3,'kiwi fruit apple','green pink','bill sarah-jane frank');
UNLOCK TABLES;
vrintle
  • 5,501
  • 2
  • 16
  • 46
Max Williams
  • 32,435
  • 31
  • 130
  • 197
  • 3
    Show us some sample table data and the expected result - all as formatted text (no images.) And take a look at https://stackoverflow.com/help/reprex – jarlh May 16 '19 at 10:30
  • have added, thanks. – Max Williams May 16 '19 at 10:54
  • 1
    What version of MySQL? – Nick May 16 '19 at 10:55
  • 5.6 - i've added this to the question body too. – Max Williams May 16 '19 at 10:57
  • 1
    @MaxWilliams Maximum number of words in a particular value is 3 ? or it can be anything ? – Madhur Bhaiya Jul 02 '19 at 09:43
  • There can be any number of words in a value, thanks. I'll add this as an edit to the question. – Max Williams Jul 02 '19 at 10:09
  • 1
    The simplest and the fastest way to achieve that is to use [CTE](https://dev.mysql.com/doc/refman/8.0/en/with.html) - Common Table Expressions, which enables you to "scan" cols and rows recursively. I do not have MySQL. Are you interested in MS SQL Server example? – Maciej Los Jul 02 '19 at 10:10
  • @MaciejLos any solution needs to be compatible with MySQL 5.6, so an MS SQL server example wouldn't be useful. However, if there's a CTE (i've not encountered these) approach for MySQL I would be interested to hear it, thanks. – Max Williams Jul 02 '19 at 10:27
  • Sorry, but MySQL prior to 8.0 doesn't support recursive queries ;( See: [How do you use the “WITH” clause in MySQL?](https://stackoverflow.com/questions/1382573/how-do-you-use-the-with-clause-in-mysql) A "working example" for MySQL 8.0 you'll find here: [DBFillde](https://www.db-fiddle.com/f/kR3bMFjbUeGp7X5JPNor6c/1) – Maciej Los Jul 02 '19 at 10:53
  • 2
    Your task is better done in a real programming language, not the limited SQL language. – Rick James Jul 02 '19 at 16:41
  • @RickJames that had occurred to me, and in fact was the first thing I tried - i'm doing this in a Ruby on Rails app and i loaded all of the data into ruby as strings, split them into words, and uniq-ified them. The fastest I could get it was about 17 seconds, for my largest table. There is scope for optimising the ruby code, I think, but often (not always), if you can figure it out, a raw SQL query is the fastest way to do something, which is why I asked this question. I may well end up doing most of the processing in ruby. – Max Williams Jul 03 '19 at 08:25
  • How many rows and what content length (avg number of characters in each column) are we talking about? – Salman A Jul 05 '19 at 14:59
  • This is one of the many reasons that I dislike MySQL. It lacks many features. In PostgreSQL you could simply use `DISTINCT UNNEST(STRING_TO_ARRAY(column))`. I did get a query working that does not require a function/procedure to be created and does not have a series of `UNION JOIN`s. It uses a cross join to create a sequence and then `SUBSTRING_INDEX` the string. See below. – derek.wolfe Jul 05 '19 at 19:21
  • @salmanA for the largest table, around 100,000 rows and an average of 340 chars per row, spread across 8 columns. – Max Williams Jul 08 '19 at 13:13

11 Answers11

2

The best solution is not using that structure to store data and normalize your database in compliance with normal forms. But if you want to split strings to words and get them as a table and you can't normalize the database and you can't use the latest version of MYSQL with CTE you could create a simple stored procedure to split strings and store them to a temporary table. For example, the stored procedure might look like:

DELIMITER //
CREATE PROCEDURE split_string_to_table (str longtext)
BEGIN
  DECLARE val TEXT DEFAULT NULL;
  DROP TEMPORARY TABLE IF EXISTS temp_values;
  CREATE TEMPORARY TABLE temp_values (
     `value` varchar(200)  
  );

  iterator:
  LOOP  
    IF LENGTH(TRIM(str)) = 0 OR str IS NULL THEN
      LEAVE iterator;
    END IF;
    SET val = SUBSTRING_INDEX(str, ' ', 1);
    INSERT INTO temp_values (`value`) VALUES (TRIM(val));
    SET str = INSERT(str, 1, LENGTH(val) + 1, '');
  END LOOP;
  SELECT DISTINCT(`value`) FROM temp_values WHERE CHAR_LENGTH(`value`) >= 5;
END //
DELIMITER ;

After it, you can join all strings to one string and store it in a temporary variable and pass its value to the stored procedure:

SELECT CONCAT_WS(' ', 
                 GROUP_CONCAT(col_a SEPARATOR ' '), 
                 GROUP_CONCAT(col_b SEPARATOR ' '), 
                 GROUP_CONCAT(col_c SEPARATOR ' ')
       ) INTO @text
FROM mytable;

CALL split_string_to_table(@text);

Result:

--------------
| value      |
--------------
| apple      |
--------------
| orange     |
--------------
| banana     |
--------------
| fruit      |
--------------
| green      |
--------------
| yellow     |
--------------
| frank      |
--------------
| james      |
--------------
| sarah-jane |
--------------

You can see the demo of that realization in DBFiddle

Maksym Fedorov
  • 6,383
  • 2
  • 11
  • 31
1

Using a SELECT inside another SELECT, UNION SELECT and SUBSTRING_INDEX function managed to come up with the following result

SELECT DISTINCT results.col_a as "values"

FROM(

    SELECT DISTINCT      
    SUBSTRING_INDEX(SUBSTRING_INDEX(tablename.col_a, ' ', numbers.n), ' ', -1) col_a
    FROM (SELECT 1 n UNION ALL SELECT 2
    UNION ALL SELECT 3 UNION ALL SELECT 4) numbers INNER JOIN tablename
    ON CHAR_LENGTH(tablename.col_a)-CHAR_LENGTH(REPLACE(tablename.col_a, ' ', ''))>=numbers.n-1

    UNION DISTINCT
    SELECT DISTINCT
    SUBSTRING_INDEX(SUBSTRING_INDEX(tablename.col_b, ' ', numbers.n), ' ', -1) col_b
    FROM (SELECT 1 n UNION ALL SELECT 2
    UNION ALL SELECT 3 UNION ALL SELECT 4) numbers INNER JOIN tablename
    ON CHAR_LENGTH(tablename.col_b)-CHAR_LENGTH(REPLACE(tablename.col_b, ' ', ''))>=numbers.n-1

    UNION DISTINCT
    SELECT DISTINCT
    SUBSTRING_INDEX(SUBSTRING_INDEX(tablename.col_c, ' ', numbers.n), ' ', -1) col_c
    FROM (SELECT 1 n UNION ALL SELECT 2
    UNION ALL SELECT 3 UNION ALL SELECT 4) numbers INNER JOIN tablename
    ON CHAR_LENGTH(tablename.col_c)-CHAR_LENGTH(REPLACE(tablename.col_c, ' ', ''))>=numbers.n-1

) AS results

WHERE length(results.col_a) > 4

Result

+-----------+
|values     |
+-----------+
|apple      |
+-----------+
|banana     |
+-----------+
|frank      |
+-----------+
|fruit      |
+-----------+
|green      |
+-----------+
|james      |
+-----------+
|orange     |
+-----------+
|sarah-jane |
+-----------+
|yellow     |
+-----------+
Binara Medawatta
  • 512
  • 1
  • 9
  • 28
  • This appears to work, but it's **incredibly slow**, with the real data - like "Kill it after 30 minutes, I didn't even let it finish" slow. I didn't specify that it needs to be not incredibly slow, so if no-one comes up with a faster solution I will mark this correct, thanks. – Max Williams Jul 02 '19 at 13:59
  • Thanks for the comment, I'll try to come up with a faster solution – Binara Medawatta Jul 02 '19 at 14:47
  • 2
    You must remove `tablename,` from the `FROM` clause, currently you get a CROSS JOIN. – dnoeth Jul 04 '19 at 09:53
1

How about exporting the data into a file and then import it into a new table?

select col_a, col_b, col_c
    into outfile '/tmp/words.csv'
    fields terminated by ' ' escaped by '' 
    lines  terminated by ' '
    from tablename;

create table tmp_words(word varchar(50));

load data infile '/tmp/words.csv'
    into table tmp_words
    lines terminated by ' ';

select distinct word from tmp_words where char_length(word) >= 5;

drop table tmp_words;

Result:

word
----------
apple
orange
green
banana
yellow
frank
james
fruit
sarah-jane

Since you have a FULLTEXT INDEX, you could also just read the words from the information_schema:

set global innodb_ft_aux_table = 'test/tablename';

select WORD
from information_schema.INNODB_FT_INDEX_TABLE
where char_length(WORD) >= 5
union 
select WORD
from information_schema.INNODB_FT_INDEX_CACHE
where char_length(WORD) >= 5

However - Due to how the FULLTEXT INDEX works, "words" like "sarah-jane" will be splitted. You can see that in the result:

WORD
------
apple
banana
frank
fruit
green
james
orange
sarah   <-- !
yellow

db-fiddle

You will also miss stopwords like "about".

See: InnoDB INFORMATION_SCHEMA FULLTEXT Index Tables

Paul Spiegel
  • 30,925
  • 5
  • 44
  • 53
1

Shell script might be efficient...

  1. SELECT CONCAT_WS(' ', col_a, col_b, col_c) INTO OUTFILE 'x' ... to get the columns into a file
  2. tr ' ' "\n" <x -- split into one word per line
  3. awk 'length($1) >= 5' -- minimum size of 5 characters per word
  4. sort -u -- to dedup

There are no stopwords, but sed or awk could deal with that.

 mysql -e "SELECT ... INTO OUTFILE 'x' ..." ...
 tr ' ' "\n" <x  |  awk 'length($1) >= 5'  |  sort -u
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Thanks - this is a promising approach, but `uniq` doesn't seem to be uniqifying the lines. This is what I'm doing: `dbname='my_database'; fname='/tmp/tablename_strings.txt'; sudo chown max:max $fname; rm $fname; mysql -u root $dbname -e "SELECT CONCAT_WS(' ', col_a, col_b, col_c) FROM tablename INTO OUTFILE '$fname'"; tr ' ' '\n' <"$fname" | awk 'length($1) >= 5' | uniq` and I get this (separated by newlines): `apple orange green orange banana yellow frank james fruit apple green sarah-jane frank ` – Max Williams Jul 04 '19 at 09:27
  • 1
    @MaxWilliams - Oops, I forgot the `sort`. (And there may be a de-dup option on `sort`, which would avoid the need for `uniq`.) – Rick James Jul 04 '19 at 14:58
  • 1
    You could use `sort -u` to combine the sort and uniq steps. – Bill Karwin Jul 07 '19 at 13:28
  • I've given this this tick (and bounty) as it's the fastest of the many suggestions. Thanks! – Max Williams Jul 08 '19 at 13:21
  • Sometimes it helps to "think out of the box". (In this case the box is MySQL.) – Rick James Jul 08 '19 at 13:31
1

Here is my solution. Instead of calculating for each column and doing UNION, I have concatenated all columns first using CONCAT_WS. Then get the distinct value and apply the conditions you want. This way you can neglect the union and increase the performance.

SELECT MYWORD FROM (
SELECT
  DISTINCT SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT_WS(' ', COL_A, COL_B, COL_C), ' ', N.DIGIT+1), ' ', -1)  MYWORD 
FROM
  MYTABLE 
  INNER JOIN
  (SELECT 0 DIGIT UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3  UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6) N
  ON (LENGTH(REPLACE(CONCAT_WS(' ', COL_A, COL_B, COL_C), ' ' , '')) <= LENGTH(CONCAT_WS(' ', COL_A, COL_B, COL_C))-N.DIGIT) ) MYRESULT
  WHERE LENGTH(MYRESULT.MYWORD)>=5

OUTPUT : DBFIDDLE

Arun Palanisamy
  • 5,281
  • 6
  • 28
  • 53
1

From your performance requirements and comments, it appears that you need to run this query regularly. Unfortunately, your data just isn't at the right resolution to do this neatly or succinctly


I would consider adding a summary table of sorts to assist with the final query. By maintaining the summary table, as and when data in the main table changes, you should be able to keep things simpler

A suggested format for this summary table would be

  • summary_table - id, main_table_id, column_name, word

Where main_table_id is a foreign key to your main table's id column

You could also place a composite unique index on (main_table_id, column_name, word)


On editing a relevant column value in the main table, you should adjust the summary table

  • Remove existing words for the main_table_id and column_name
  • Insert a new list of unique words, of at least 5 characters, for the main_table_id and column_name

This could either be done at the application level or using a trigger


This would make the final query much simpler..

SELECT DISTINCT word
  FROM summary_table
Arth
  • 12,789
  • 5
  • 37
  • 69
  • 1
    Thanks. Funnily enough the tables in question are **already** "summary" tables of sorts, being the data that I want to index for a full text search on a corresponding other table. What I wanted the unique words for is use in a suggestions algorithm, like *"We couldn't find any results matching 'applx', did you mean 'apple'?"*. I think that the best approach for this is to generate the word list offline (in a scheduled job, say), and store it in a seperate table, as you suggest. – Max Williams Jul 04 '19 at 13:41
  • 1
    @MaxWilliams If they are **already** summary tables, I'd suggest they aren't summarising enough! It's a bit suspicious that you have words all neatly separated by spaces, could you not split the words up into separate rows as these summary tables are created? – Arth Jul 04 '19 at 13:46
  • I don't want to change the schema I have already, really - this word-matching aspect is like an extra layer on top. – Max Williams Jul 04 '19 at 14:01
  • 1
    I may well add this as an extra layer, though, so thanks for the push in that direction. – Max Williams Jul 08 '19 at 13:29
1

EDIT: Posting a new solution here after noticing you are using MySQL 5.6.

Use of a few variables will allow you to not need a function/procedure or weird UNION joins. This uses a cross join on the current table to generate a sequence that allows us to split the string by a delimiter.

Try this after your create statement:

SET @row = 0;
SET @list = (SELECT GROUP_CONCAT(CONCAT_WS(' ', col_a, col_b, col_c) SEPARATOR ' ') FROM tablename);
SET @limiter = (SELECT LENGTH(@list) - LENGTH(REPLACE(@list, ' ', '')) + 1);

SELECT DISTINCT word 
FROM (
    SELECT RTRIM(
               SUBSTRING_INDEX(
                   SUBSTRING_INDEX(
                       @list,
                       ' ',
                       r
                   ),
                   ' ',
                   -1
               )
           ) AS word
    FROM (SELECT @row := @row + 1 AS r
          FROM tablename t1, tablename t2, tablename t3) gen_seq
    WHERE r <= @limiter
) words WHERE LENGTH(word) >= 5;

Depending on the size of your table, you may be able to remove tablename t3 from the cross join to speed up the query. Since the table was small, 3 iterations of the cross join was necessary.


If you are using at least MySQL 8, recursion is an option.

I took the table you created and ran the following on it:

SET @list = (SELECT GROUP_CONCAT(CONCAT_WS(' ', col_a, col_b, col_c) SEPARATOR ' ') 
             FROM tablename);

WITH RECURSIVE words AS (
    (SELECT 1 AS n, @list AS words)
    UNION 
    (SELECT n+1 AS m, @list
     FROM words 
     WHERE n < (LENGTH(words) - LENGTH(REPLACE(words,' ', ''))) + 1
    )
)
SELECT DISTINCT LTRIM(
           SUBSTRING(
               SUBSTRING_INDEX(words, ' ', n),
               CHAR_LENGTH(
                   SUBSTRING_INDEX(words, ' ', n-1)
               ) + 1
           )
       ) word 
FROM words
WHERE n <= (LENGTH(words) - LENGTH(REPLACE(words,' ', ''))) + 1
HAVING LENGTH(word) >= 5;

Both of these options will give this result:

  1. apple
  2. orange
  3. green
  4. banana
  5. yellow
  6. frank
  7. james
  8. fruit
  9. sarah-jane

EDIT: Leaving the MySQL 8 option in case it is useful to someone in the future.

derek.wolfe
  • 1,086
  • 6
  • 11
1

As fond as I am of SQL solutions, this is a case that doesn't fit well. SQL wants to treat each column as an atomic value, and your design of storing a list of words in a single TEXT column works against that principle. It's equivalent to using a comma-separated list.

The solutions for this task in pure SQL are all complex, and that should be a red flag that it's the wrong approach. If you use a code solution that's at the limit of your understanding, it'll be too hard to debug or maintain.

You mentioned in the comments that you were open to a Ruby solution. I tested this. The advantage is that the code is much more clear.

require 'mysql2'

client = Mysql2::Client.new(:host => "localhost", :database => "test", :username => "root")

words = {}
client.query("SELECT LOWER(CONCAT_WS(' ', col_a, col_b, col_c)) AS words FROM tablename").each do |row|
  row["words"].split(' ').each do |word|
    if word.length >= 5
      words[word] = true
    end
  end
end

print words.keys

Output:

["apple", "orange", "green", "banana", "yellow", "frank", "james", "fruit", "sarah-jane"]

I would actually store the words individually, not in a list. I know you said you don't want to change your schema, but that's necessary if you want this to run both efficiently and with a simpler solution.

CREATE TABLE words (
  id SERIAL PRIMARY KEY,
  word_length SMALLINT NOT NULL,
  word VARCHAR(191) NOT NULL,
  KEY (word_length)
);

mysql> SELECT DISTINCT word FROM words WHERE word_length >= 5;
+------------+
| word       |
+------------+
| apple      |
| orange     |
| green      |
| banana     |
| yellow     |
| frank      |
| james      |
| fruit      |
| sarah-jane |
+------------+
Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • Thanks. I actually ended up doing this myself, using the sql code from @RickJames' answer above, and it is the fastest that i've found. Pure SQL solutions seem to be cripplingly slow. – Max Williams Jul 08 '19 at 13:28
0

Assumption: Words are separated by a single space character only (not multiple). It will get much more complicated if there are multiple spaces.

  1. We will have to use a number generator sequence. As per your case, I have only considered a sequence from 1 to 6. You can obviously consider a bigger sequence. Look for ways to generate them here: https://dba.stackexchange.com/questions/75785/how-to-generate-a-sequence-in-mysql
  2. Using multiple string operations, you can convert a space-separated string to rows. In WHERE, we will specify the character length constraint utilizing CHAR_LENGTH function. We can then use UNION to combine results for col_a, col_b and col_c separately.

View on DB Fiddle

Query #1

(
SELECT 
  SUBSTRING_INDEX(SUBSTRING_INDEX(t.col_a, ' ', ngen.num), ' ', -1) AS word 
FROM
  tablename AS t 
  INNER JOIN
  (SELECT 1 num UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4  UNION ALL SELECT 5 UNION ALL SELECT 6) AS ngen
    ON LENGTH(REPLACE(t.col_a, ' ' , '')) <= LENGTH(t.col_a)-ngen.num+1
WHERE CHAR_LENGTH(SUBSTRING_INDEX(SUBSTRING_INDEX(t.col_a, ' ', ngen.num), ' ', -1)) >= 5
)

UNION 

(
SELECT 
  SUBSTRING_INDEX(SUBSTRING_INDEX(t.col_b, ' ', ngen.num), ' ', -1) AS word 
FROM
  tablename AS t 
  INNER JOIN
  (SELECT 1 num UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4  UNION ALL SELECT 5 UNION ALL SELECT 6) AS ngen
    ON LENGTH(REPLACE(t.col_b, ' ' , '')) <= LENGTH(t.col_b)-ngen.num+1
WHERE CHAR_LENGTH(SUBSTRING_INDEX(SUBSTRING_INDEX(t.col_b, ' ', ngen.num), ' ', -1)) >= 5
)

UNION 

(
SELECT 
  SUBSTRING_INDEX(SUBSTRING_INDEX(t.col_c, ' ', ngen.num), ' ', -1) AS word 
FROM
  tablename AS t 
  INNER JOIN
  (SELECT 1 num UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4  UNION ALL SELECT 5 UNION ALL SELECT 6) AS ngen
    ON LENGTH(REPLACE(t.col_c, ' ' , '')) <= LENGTH(t.col_c)-ngen.num+1
WHERE CHAR_LENGTH(SUBSTRING_INDEX(SUBSTRING_INDEX(t.col_c, ' ', ngen.num), ' ', -1)) >= 5
);

Result:

| word       |
| ---------- |
| apple      |
| orange     |
| fruit      |
| banana     |
| yellow     |
| green      |
| frank      |
| james      |
| sarah-jane |
Madhur Bhaiya
  • 28,155
  • 10
  • 49
  • 57
0

Method

I'd recommend using a regular expression replace function to do this. This gives greater flexibility if the requirements were to ever change, e.g. if words might be separated by multiple spaces or other types of whitespace such as tab characters or even to potentially handle punctuation such as commas and full stops. Given you've stated MySQL v5.6, the newer REGEXP_REPLACE function wouldn't be available - but a few years ago I wrote a custom-built regular expression function to fill the gap. Here it is in action...

Demo

Rextester online demo: https://rextester.com/DCJE11797

SQL

SELECT DISTINCT word
FROM 
(SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(words, '¬', digits.idx + 1), '¬', -1) word
 FROM
 (SELECT reg_replace(LOWER(TRIM(CONCAT(col_a, ' ', col_b, ' ', col_c))),
                     '[[:space:]]+',
                     '¬',
                     TRUE,
                     1,
                     0) AS words
  FROM table_name) delimited
 INNER JOIN
 (SELECT @row := @row + 1 as idx FROM 
  (SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5
            UNION ALL SELECT 6 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8
            UNION ALL SELECT 9) t1,
  (SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5
            UNION ALL SELECT 6 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8
            UNION ALL SELECT 9) t2, 
  (SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5
            UNION ALL SELECT 6 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8
            UNION ALL SELECT 9) t3, 
  (SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5
            UNION ALL SELECT 6 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8
            UNION ALL SELECT 9) t4, 
  (SELECT @row := -1) t5) digits
 ON LENGTH(REPLACE(words, '¬' , '')) <= LENGTH(words) - digits.idx) subq
WHERE CHAR_LENGTH(word) >= 5

Output

word
1   apple
2   banana
3   frank
4   fruit
5   green
6   james
7   orange
8   sarah-jane
9   yellow

Explanation

A few tricks are used in the SQL above and some accreditation is needed:

  1. The columns are concatenated with a space added between each, trimmed to remove leading/trailing spaces and converted to lower case: LOWER(TRIM(CONCAT(col_a, ' ', col_b, ' ', col_c))
  2. The regular expression replacer is then used to replace all continuous blocks of whitespace - each being replaced by a single ¬ character: reg_replace(str, '[[:space:]]+', '¬', TRUE, 1, 0). Note: A different character could be chosen instead if there is any possibility of this character appearing in the words.
  3. The technique from this answer is used for transforming a string with delimited values into separate row values. It's combined with the clever technique from this answer for generating a table consisting of a sequence of incrementing numbers: 0 - 10,000 in this case.
  4. The use of DISTINCT guarantees that no word is repeated in the output (could use a GROUP BY instead if you want to count occurrences). And optionally an ORDER BY could be used to order the words alphabetically - but it sounds like you may wish to remove this to speed things up.
Steve Chambers
  • 37,270
  • 24
  • 156
  • 208
0

If your data always are two or three parts, you can use this simple and fast way:

CREATE TEMPORARY TABLE temp1
SELECT SUBSTRING_INDEX(p.col_a, ' ', 1) col1 FROM table1 p
UNION
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(p.col_a,' ',2),' ',-1) col1 FROM table1 p
UNION
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(p.col_a,' ',3),' ',-1) col1 FROM table1 p
UNION
SELECT SUBSTRING_INDEX(p.col_b, ' ', 1) col1 FROM table1 p
UNION
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(p.col_b,' ',2),' ',-1) col1 FROM table1 p
UNION
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(p.col_b,' ',3),' ',-1) col1 FROM table1 p
UNION
SELECT SUBSTRING_INDEX(p.col_c, ' ', 1) col1 FROM table1 p
UNION
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(p.col_c,' ',2),' ',-1) col1 FROM table1 p
UNION
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(p.col_c,' ',3),' ',-1) col1 FROM table1 p;

SELECT DISTINCT col1 FROM temp1
WHERE CHAR_LENGTH(col1) >= 5
Saeid Amini
  • 1,313
  • 5
  • 16
  • 26