5

I need to search a text (around 500 words long) for words in an English dictionary (around 275,000 keywords) to detect non-English words and right now the query I am using is not really optimized which takes more than 10 seconds to execute (there's a words table and a texts table):

SELECT word FROM words WHERE 'The quick brown fox jumps over the lazy dog' LIKE CONCAT( '%', word, '%' );

Got the idea from here.

I have already set the word field as an Index and seen some examples of people storing the text in the database or putting it directly in the query.

Other examples showed people using FULLTEXT search although having 300k words I don't think a FULLTEXT will work, I guess it's good to search with logic +brown +lazy -apple but in my case I don't need much logic.

Another example I've seen is to concate words with the IN (...) clause although having 500m keywords the query would just be insanely long.

Any ideas what to do?

Right now the text is saved as a text field and the words as varchar(50) in InnoDB with utf8_unicode_ci encoding, I've heard InnoDB is slow so I could use MyISAM or any other. I am using MySQL 5.5 although I could update to 5.6 if that helped.

Community
  • 1
  • 1
Kenzier
  • 53
  • 5

2 Answers2

2

LIKE comparisons are basically just wildcard-capable equality tests. They are not a generic keyword search engine.

WHERE foo LIKE '%a b%' would find any records that contain the literal text a b anywhere in the foo field, they don't look for a or b separately, a b is a single monolithic "word" and that word is searched for in its entirety.

If you want to search for multiple "words" using LIKE, you have to do

WHERE foo LIKE '%a%' OR foo LIKE '%b%' OR etc...

which quickly gets ugly, and extremely inefficient - %... search cannot use indexes.

You'd be better off switching to a fulltext search system instead, where you can have the far simpler

WHERE MATCH(foo) AGAINST ('a b')
Marc B
  • 356,200
  • 43
  • 426
  • 500
  • Thanks Marc, the problem with MATCH AGAINST is that I'd need 275k OR statements which would make for a really long query. Do you mean I should do `MATCH(text) AGAINST(word)` where `word` is the field from MySQL? – Kenzier Sep 14 '15 at 19:15
  • no. it's `match(field1, field2,....) against ('your text')`. that text can be one word, or a bajillion words. – Marc B Sep 14 '15 at 19:16
  • Oh I see, got parameters mixed up. Yet the query will still be extremely long, like 50MB if stored in a file and require me to first retrieve all keywords and concatenate them. Would that be a problem? – Kenzier Sep 14 '15 at 19:21
  • I am reading the documentation right now and can pass the `word` field as a parameter. Thank you – Kenzier Sep 14 '15 at 19:24
  • query length limit is governed by mysql's `max_allowed_packet` setting. anything bigger than that will get truncated. – Marc B Sep 14 '15 at 19:27
  • I'm upgrading to MySQL 5.6 to do fulltext searches the way you suggest. Thanks a lot Marc! – Kenzier Sep 14 '15 at 19:47
0

InnoDB can get pretty slow when you start getting into the millions of records. This is largely due to how it locks the rows when accessing the table.

I would use MyIsam so you can do a FULLTEXT search. Perhaps something like:

    select word from words where match(text) against(word)

I am not sure of the efficiency, but you do not really need to use the logic you were talking about, I don't think.

EDIT:

My code would really need to have more than one pass to work, as the second parameter really needs to be all of the words. You could use a FOR loop inside SQL to populate it, I suppose, but I will have to think about the code to do that. Probably a cursor, or stored procedure would do the trick.

I DO agree with the other answer, though, that you need to use a FULLTEXT search.

stubsthewizard
  • 352
  • 2
  • 12