0

I have a dictionary in an MYSQL table, the table consists of 240 000 words. If I for example have the letters G, I, G, S, N and O I would like to select all the words in the table that contain all or some of these letters (and no other letters).

Acceptable words would for example include:

  • going
  • song
  • son
  • so
  • on

Examples of unacceptable words:

  • songs (one S more than allowed)
  • words longer that the number of characters (6 in this case)

What would the MYSQL query look like?

My current MYSQL looks like:

SELECT * FROM `list` 
WHERE word like '%S%' and word like '%O%' and word like '%G%'

I want to use 6 or 7 letters and find words that are:

  • equally long
  • shorter

Now I only find words that are equally long or longer and that contain other letters as well.

Ankur Sinha
  • 6,473
  • 7
  • 42
  • 73
  • Please edit your question to include a [mcve], see https://meta.stackoverflow.com/questions/333952/why-should-i-provide-a-minimal-reproducible-example-for-a-very-simple-sql-query. Also add the SQL queries you have tried and explain why they didn't return the result you want. – Progman Jul 20 '19 at 15:43
  • Can you change the layout of the tables to create new columns? Can you even add a new table for additional information? This problem is loosely related to "saving comma separated values", even though no commas are involved here. As seen on https://stackoverflow.com/questions/3653462/is-storing-a-delimited-list-in-a-database-column-really-that-bad it is very difficult to handle such kind of data, as you already have noticed. – Progman Jul 20 '19 at 15:53
  • I don't understand what you want with "6 or 7 ..." -- give some examples, and I may further enhance my Answer. – Rick James Jul 22 '19 at 03:54

1 Answers1

1

This is a starting point:

(I will insist that you construct the query from the letters you desire.)

If the column has only one word:

WHERE word REGEXP '^[GISNO]+$'

If the column can have multiple words, this will pick the row (but not the word), then before version 8.0:

WHERE word REGEXP '[[:<:]][GISNO]+[[:>:]]'

Or, with 8.0:

WHERE word REGEXP '\b[GISNO]+b'

Now to filter out "too many" of each letter. (I will assume the word is by itself in the column.)

  AND word NOT REGEXP 'G.*G.*G'   -- max of 2 Gs
  AND word NOT REGEXP 'I.*I'      -- max of 1 I
  AND word NOT REGEXP 'O.*O'      -- max of 1 O
  AND word NOT REGEXP 'S.*S'      -- max of 1 S

Another approach involves building an extra column with the letters alphabetized.

going   ggino
song    gnos
son     nos
so      os
on      no
no      no   -- note the dup in the new column

Now the test becomes

WHERE sorted_word REGEXP '^g{0,2}i?n?o?s?$'

This should run somewhat faster.

And some other things may run faster using this trick.

Rick James
  • 135,179
  • 13
  • 127
  • 222