0

I have 2 tables:

Dictionary - Contains roughly 36,000 words

CREATE TABLE IF NOT EXISTS `dictionary` (
  `word` varchar(255) NOT NULL,
  PRIMARY KEY (`word`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Datas - Contains roughly 100,000 rows

CREATE TABLE IF NOT EXISTS `datas` (
  `ID` int(11) NOT NULL AUTO_INCREMENT,
  `hash` varchar(32) NOT NULL,
  `data` varchar(255) NOT NULL,
  `length` int(11) NOT NULL,
  `time` int(11) NOT NULL,
  PRIMARY KEY (`ID`),
  UNIQUE KEY `hash` (`hash`),
  KEY `data` (`data`),
  KEY `length` (`length`),
  KEY `time` (`time`)
) ENGINE=InnoDB  DEFAULT CHARSET=latin1 AUTO_INCREMENT=105316 ;

I would like to somehow select all the rows from datas where the column data contains 1 or more words.

I understand this is a big ask, it would need to match all of these rows together in every combination possible, so it needs the best optimization.

I have tried the below query, but it just hangs for ages:

SELECT      `datas`.*, `dictionary`.`word`
FROM        `datas`, `dictionary`
WHERE       `datas`.`data` LIKE CONCAT('%', `dictionary`.`word`, '%')
AND         LENGTH(`dictionary`.`word`) > 3
ORDER BY    `length` ASC
LIMIT       15

I have also tried something similar to the above with a left join, and on clause that specified the like statement.

Drahcir
  • 11,772
  • 24
  • 86
  • 128
  • @eggyal: Thanks, but I tried and didn't work. According to the answer on this post (http://stackoverflow.com/questions/10465758/subquery-incorrect-arguments-to-against-using-mysql), the `AGAINST` must be a literal string. – Drahcir Sep 15 '12 at 14:59
  • 2
    "even with the 15 row limit" is not very relevant for your query since the limiting can only take place after the result is determined and sorted. – Eddy Sep 15 '12 at 15:00
  • "1 or more words". Words from the dictionary table? – Jose Adrian Sep 15 '12 at 15:38
  • @JoseAdrian: Yes, from the dictionary table. – Drahcir Sep 15 '12 at 15:42
  • I think your query should work for what you want because you said to find **1** or more. If you find 1, you have it. Or do you want to know how many coincidences exist and what were the word that matched? – Jose Adrian Sep 15 '12 at 16:13

3 Answers3

1

This is actually not an easy problem, what you are trying to perform is called Full Text Search, and relational databases are not the best tools for such a task. If this is some kind of a core functionality consider using solutions dedicated for this kind of operations, like Sphinx Search Server.

If this is not a "Mission Critical" system, you can try with something else. I can see that datas.data column isn't really long, so you can create a structure dedicated for your task and keep maintaining it during operational use. Fore example, create table:

dictionary_datas (
    datas_id FK (datas.id),
    word FK (dictionary.word)
)

Now anytime you insert, delete or simply modify datas or dictionary tables you update dictionary_datas placing there info which datas_id contains which words (basically many to many relations). Of course it will degradate your performance, so if you have high high transactional load on your system, you have to do this periodicaly. For example place a Cron Job which runs every night at 03:00 am and actualize the table. To simplify the task you can add a flag TO_CHECK into DATAS table, and actualize data only for those records having there 1 (after you actualise dictionary_datas you switch the value to 0). Remember by the way to refresh whole DATAS table after an update to DICTIONARY table. 36 000 and 100 000 aren't big numbers in terms of data processing.

Once you have this table you can just query it like:

SELECT datas_id, count(*) AS words_num FROM dictionary_datas GROUP BY datas_id HAVING count(*) > 3;

To speed up the query (and yet slow down it's update) you can create a composite index on its columns datas_id, word (in EXACTLY that order). If you decide to refresh the data periodicaly you should remove the index before refresh, than refresh the data, and finaly create the index after refreshing - this way will be faster.

WojtusJ
  • 1,318
  • 1
  • 12
  • 19
  • Thanks, the idea of keeping a dedicated structure is good. The datas table contains data from a web service that is collected nightly and inserted with `load data`. I've managed to narrow down the query time to around 6 mins now which is fine (because it only needs to run once). – Drahcir Sep 15 '12 at 15:50
  • Narrow down with or without the structure I proposed? I suppose your data in DATAS table will grow, and thus it will be slowing down. If you implement any incremental approach (like updating only changed / inserted rows nightly) you will be able to more or less controll the processing time. – WojtusJ Sep 15 '12 at 16:14
  • I narrowed down by creating alias tables before doing the main query (thus reducing the number of records to match against first (the length > 3 was one of them)). The datas table is only ever used once, all data is fresh every day. – Drahcir Sep 15 '12 at 16:19
0

I'm not sure if I understood your problem, but I think this could be a solution. Also, I think people don't like Regular Expression but this works for me to select columns where their value has more than 1 word.

SELECT * FROM datas WHERE data REGEXP "([a-z] )+"

Jose Adrian
  • 1,217
  • 1
  • 17
  • 32
  • Thanks, but it doesn't solve the problem. The words must come from the dictionary table defined in the question, the table contains only certain key words that are relevant to the task. – Drahcir Sep 15 '12 at 15:38
0

Have you tried this?

 select *
 from dictionary, datas
 where position(word,data) > 0 
 ;

This is very inefficient, but might be good enough for you. Here is a fiddle.

For better performance, you could try placing a text search index on your text column DATA and then using the CONTAINS function instead of POSITION.

Turophile
  • 3,367
  • 1
  • 13
  • 21