SQL Many to Many Query Find All Multiple Matches Duplicate Issue

Question

I've been using a JPQL query like this:

SELECT s 
FROM Sentence s
INNER JOIN s.words sw
WHERE s.date = :date
AND sw IN (:words)
GROUP BY s
HAVING count(sw) = :numberOfWords

In SQL s.words would be the linking table in the many to many :word is the collection of words we are looking to match :numberOfWords is the number of words in the collection

For those not familiar with JPQL, the similar SQL is found here: SQL query to find matches for multiple criteria

I'm running into a problem and I don't know if it's a design or a query problem.

So I have sentences (s) and words (w) and the sentence_word (sw) table joins the two. For example sentence "The lead dog of the group is blue" and word collection of [ the , cat ], this query will select this sentence as a match. It's a false positive because it contains "the" twice and does not contain "cat".

In SQL I think it comes out as:

SELECT s.id
FROM sentence s
JOIN sentence_word sw ON ( sw.s_id = s.id )
JOIN word w ON ( w.id = sw.w_id )
WHERE w.word IN ( 'the', 'cat' )
GROUP BY s.id
HAVING COUNT(1) = 2

The easiest solution I can think of is to just disallow duplicates in the linking table so instead of two links for the in this sentence, there is just one (thus the linking table stores that the sentence contains the word the, just not how many times). However, I think there will be times where I would want to search for a sentence that contains "the" many times so I would prefer not to use this option.

Help please!

Resources

Some good examples of similar searches and performance can be found here (thanks to the answerer for the link): How to filter SQL results in a has-many-through relation

The "IN" clause will match as long as _any_ of the words in the list are found. You need to find a way to make sure _all_ of the words are found. — Mike Parkhill, Oct 10 '12 at 02:02

score 1 · Accepted Answer · answered Oct 10 '12 at 07:27

1

Change the HAVING clause to count distinct words, with eIther COUNT(DISTINCT sw.w_id) = 2 or COUNT(DISTINCT w.word) = 2:

SELECT s.id
FROM sentence s
JOIN sentence_word sw ON ( sw.s_id = s.id )
JOIN word w ON ( w.id = sw.w_id )
WHERE w.word IN ( 'the', 'cat' )
GROUP BY s.id
HAVING COUNT(DISTINCT sw.w_id) = 2 ;

answered Oct 10 '12 at 07:27

ypercubeᵀᴹ

113,259
19
174
235

Upon quick testing, this seems to work as desired in SQL, thanks! I do generally try to avoid distinct, and didn't even think to have it in having. I'm going to keep testing on bigger data sets and then accept. Performance seems to be pretty decent too. – bladmiral Oct 10 '12 at 10:50
Regarding performance, you should examine the execution (EXPLAIN) plan and the indexes you have. You can also try other rewritings of the query. See this question: [How to filter SQL results in a has-many-through relation](http://stackoverflow.com/questions/7364969/how-to-filter-sql-results-in-a-has-many-through-relation) which has more than 10 versions for the same problem (in your case, they'd have to be changed slightly because you don't have a unique constraint on the `(s_id, w_id)` combination). – ypercubeᵀᴹ Oct 10 '12 at 11:01
The Joins usually perform better but require dymanic SQL. The "general" solutions (like yours) are easier to write and can work with a variable parameter list which can be a table. – ypercubeᵀᴹ Oct 10 '12 at 11:03
Thanks, yes, I saw that other link and it is great. I was really looking for a way to avoid writing loads of subqueries and adding distinct here works great. I'm probably will to sacrifice some sql performance in exchange for not having a bunch of loops creating subqueries. Thanks so much for the help! – bladmiral Oct 10 '12 at 11:08

SQL Many to Many Query Find All Multiple Matches Duplicate Issue

1 Answers1