1

Found Problem:

I was able to significantly reduce the response time to less than 0.01 seconds from 1.1 seconds simply by removing the ORDER BY clause.

I was doing a USORT in PHP on dates anyway since the post-processing would reorder the posts by type of match, so the ORDER BY clause was unnecessary.

The query described by Larry Lustig is very fast for integers. It was the ordering by a string dates which caused significant performance problems.

Hope this servers as a good example of how to apply conditions to multiple rows and why you want to watch out for database operations involving strings.

Original Question (Edited for Clarity):

I'm working on an indexed search with keyword stemming. I'm looking to optimize it more.

Larry Lustig's answer helped me a lot here on how to create the query for easy scaling:

SQL for applying conditions to multiple rows in a join

Any suggestions of how to optimize this example query to run faster? I am guessing there could be a way to do one join for all the "s{x).post_id = p.ID" conditions.

SELECT p.ID, p.post_title, LEFT ( p.post_content, 800 ), p.post_date 
FROM wp_posts AS p
WHERE p.post_status = 'publish' AND ( ( ( 
    ( EXISTS ( SELECT s0.post_id FROM wp_rama_search_index AS s0 
        WHERE s0.post_id = p.ID AND s0.hash = -617801035 ) )
) OR ( 
    ( EXISTS ( SELECT s1.post_id FROM wp_rama_search_index AS s1 
            WHERE s1.post_id = p.ID AND s1.hash = 1805184399 ) 
        AND EXISTS ( SELECT s2.post_id FROM wp_rama_search_index AS s2 
            WHERE s2.post_id = p.ID AND s2.hash = 1823159221 ) 
        AND EXISTS ( SELECT s3.post_id FROM wp_rama_search_index AS s3 
            WHERE s3.post_id = p.ID AND s3.hash = 1692658528 ) ) 
) OR ( 
    ( EXISTS ( SELECT s4.post_id FROM wp_rama_search_index AS s4 
            WHERE s4.post_id = p.ID AND s4.hash = 332583789 ) ) 
) OR ( 
    ( EXISTS ( SELECT s5.post_id FROM wp_rama_search_index AS s5 
            WHERE s5.post_id = p.ID AND s5.hash = 623525713 ) ) 
) OR ( 
    ( EXISTS ( SELECT s6.post_id FROM wp_rama_search_index AS s6 
            WHERE s6.post_id = p.ID AND s6.hash = -2064050708 ) 
        AND EXISTS ( SELECT s7.post_id FROM wp_rama_search_index AS s7 
            WHERE s7.post_id = p.ID AND s7.hash = 1692658528 ) ) 
) OR ( 
    ( EXISTS ( SELECT s8.post_id FROM wp_rama_search_index AS s8 
            WHERE s8.post_id = p.ID AND s8.hash = 263456517 ) 
        AND EXISTS ( SELECT s9.post_id FROM wp_rama_search_index AS s9 
            WHERE s9.post_id = p.ID AND s9.hash = -1214274178 ) ) 
) OR ( 
    ( EXISTS ( SELECT s10.post_id FROM wp_rama_search_index AS s10 
            WHERE s10.post_id = p.ID AND s10.hash = -2064050708 ) 
        AND EXISTS ( SELECT s11.post_id FROM wp_rama_search_index AS s11 
            WHERE s11.post_id = p.ID AND s11.hash = -1864773421 ) ) 
) OR ( 
    ( EXISTS ( SELECT s12.post_id FROM wp_rama_search_index AS s12 
            WHERE s12.post_id = p.ID AND s12.hash = -1227797236 ) ) 
) OR ( 
    ( EXISTS ( SELECT s13.post_id FROM wp_rama_search_index AS s13 
            WHERE s13.post_id = p.ID AND s13.hash = 1823159221 ) 
        AND EXISTS ( SELECT s14.post_id FROM wp_rama_search_index AS s14 
            WHERE s14.post_id = p.ID AND s14.hash = -1214274178 ) ) 
) OR ( 
    ( EXISTS ( SELECT s15.post_id FROM wp_rama_search_index AS s15 
            WHERE s15.post_id = p.ID AND s15.hash = 323592937 ) ) 
) OR ( 
    ( EXISTS ( SELECT s16.post_id FROM wp_rama_search_index AS s16 
            WHERE s16.post_id = p.ID AND s16.hash = 322413837 ) ) 
) OR ( 
    ( EXISTS ( SELECT s17.post_id FROM wp_rama_search_index AS s17 
            WHERE s17.post_id = p.ID AND s17.hash = 472301092 ) ) ) ) ) 
ORDER BY p.post_date DESC

This query runs in about 1.1s from phpMyAdmin connecting to a large AWS Aurora database instance.

There are 35k published posts. Words in post titles and excerpts of post content are inflected and hashed using FVN1A32.

The "AND" operations represent phrases where both hashes must be matched. There are several hashes because keyword stemming is being used and where aliases of keywords can also be phrases.

O. Jones
  • 103,626
  • 17
  • 118
  • 172

2 Answers2

0

I think this would do the trick for you. It is basically a question of Relational Division, with multiple possible divisors.

I must say, I'm not entirely familiar with MySQL so I may have some slight syntax errors. But I'm sure you will get the idea.

Put the different search conditions in a temporary table, each group having a group number.

We select all rows in our main table, where our temp table has a group for which the total number of hashes in that group is the same as the number of matches on those hashes. In other words, every requested hash in the group matches.

CREATE TABLE #temp (grp int not null, hash int not null, primary key (grp, hash));

SELECT p.ID, p.post_title, LEFT ( p.post_content, 800 ), p.post_date 
FROM wp_posts AS p
WHERE p.post_status = 'publish' AND 
    EXISTS ( SELECT 1
        FROM #temp t
        LEFT JOIN wp_rama_search_index AS s0
            ON s0.hash = t.hash AND s0.post_id = p.ID
        GROUP BY grp
        HAVING COUNT(*) = COUNT(s0.hash)
    );

There are other ways to slice this, for example the following, which may be more efficient:

SELECT p.ID, p.post_title, LEFT ( p.post_content, 800 ), p.post_date 
FROM wp_posts AS p
CROSS JOIN #temp t
LEFT JOIN wp_rama_search_index AS s0
    ON s0.hash = t.hash AND s0.post_id = p.ID
WHERE p.post_status = 'publish'
GROUP BY p.ID    -- functional dependency??
HAVING COUNT(*) = COUNT(s0.hash);

See also these excellent articles on SimpleTalk: Divided We Stand: The SQL of Relational Division and High Performance Relational Division in SQL Server. The basic principles should be the same in MySQL.

I would put an compound index on the temp table if you can, not sure which order.

Charlieface
  • 52,284
  • 6
  • 19
  • 43
  • dificult to follow. is there an example combining two hashes with an "AND" operation, and combining it with an "OR" operation on a single hash? – Keith Gardner Jan 21 '21 at 04:44
  • You need to think logically about what it is you trying to do. What you are saying is that you have multiple lists of hashes, you need a post to be able to **fully** match at least a single list of hashes. So we need to store the matching hashes in a temp table, with a grouping number to show which list they are part of. Then we do relational division to work out if any of the lists match up with any of the posts. – Charlieface Jan 21 '21 at 08:58
  • code for phrases and single words will differ in the sql, but doesn't matter to preg match for validation in the post-processing. i don't want to preg match 50k posts in the database, so i check 32-bit integers and reduce it down to 5k or less results. i think doing a union of a simple join of single words and with the query above for only phrases (match on multiple rows of a column) will optimize the query. i was curious if anyone knew the fastest scalable way. – Keith Gardner Jan 22 '21 at 02:02
  • Do you understand now how these two queries work and how they answer this? The second article I quote goes into detail about performance. – Charlieface Jan 22 '21 at 02:07
  • i have difficulty following aggregate logic and is why i was asking for a more detailed example where you are both returning posts that match on single hashes and match on multiple hashes that have "AND" logic. not sure about the cross join either. they are scalar and can be dangerous. not sure what the temp table does for me except to narrow down results to only possible matches before applying the "AND" logic of the phrases, which isn't a bad idea before increasing algorithm compexity. – Keith Gardner Jan 22 '21 at 04:03
  • as far as a temp table, wouldn't it be best just to have that has an inner select query? writing to a table might be slower than retaining a bunch of integers in memory. – Keith Gardner Jan 22 '21 at 04:06
  • i was able to answer my own question. the order by clause on string dates was causing the performance problems. the query without the order by worked extremely fast. – Keith Gardner Jan 22 '21 at 04:45
  • Hard to say whether it's better or worse, would depend on the query plan. Obviously you're right about using single variables, but that means every query gets differing amounts of variables, and you have to recompile every time. The benefit of the temp table is it's the same query every run. I'm unsure the performance of temp tables in MySQL. In SQL Server you can bulk load them and they don't always get stored on disk, so they're quite fast – Charlieface Jan 22 '21 at 10:27
0
SELECT s15.post_id FROM wp_rama_search_index AS s15 
        WHERE s15.post_id = p.ID AND s15.hash = 323592937

What is the real name of s15? That may give us some clues of how to improve the query. Also, what is hash about?

Change to simply SELECT 1 FROM ...; there is not an advantage (and maybe a disadvantage) in listing any columns when doing EXISTS).

This composite index, with the columns in either order, should help:

INDEX(post_id, hash)

OR is deadly to optimization. However, in your query, it may be advantageous to sort the EXISTS(...) clauses with the most likely ones first.

This is WordPress? It may be that collecting the attributes that you are searching among into a single column (in wp_rama_search_index?), applying a FULLTEXT index to the column, then using MATCH(ft_column) AGAINST ("foo bar baz ..."). This may do most of the ORs without needing hashes and lots of EXISTS. (Again, I am guessing what is going on with s15andhash`.)

Your preprocessing of the words would still be useful for synonyms, but unnecessary for inflections (since FULLTEXT handles that mostly). Phrases can be handled with quotes. That may obviate your current use of AND.

If page-load takes 1.5 seconds aside from the SELECT, I suspect there are other things that need optimizing.

danblock's suggestion about replacing OR with IN works like this:

    EXISTS ( SELECT s15.post_id FROM wp_rama_search_index AS s15 
        WHERE s15.post_id = p.ID AND s15.hash = 323592937 
           ) 
OR 
    EXISTS ( SELECT s16.post_id FROM wp_rama_search_index AS s16 
        WHERE s16.post_id = p.ID AND s16.hash = 322413837 
           ) 

-->

OR EXISTS( SELECT 1 FROM wp_rama_search_index
        WHERE post_id = p.ID
          AND hash IN ( 323592937, 322413837 )

(LIMIT 1 is not needed.) My suggestion of FULLTEXT supersedes this.

Using FULLTEXT will probably eliminate the speed variation due to having (or not) the ORDER BY.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Thank you for the examples. The performance problem was in the order by. "hash" represent fnv321a hashes of words in post titles and excerpts of post content. query now runs in less than 0.01s so replacing the single word matches with an "in" statement was not needed. there are several exists statements because keyword stemming is being used to match aliases of search terms. keywords and aliases may also be phrases. – Keith Gardner Jan 26 '21 at 05:09