-1

I have built a small system that gathers and analyses surveys and I need to find way to optimise performance. I noticed many of the queries have parts in common and they are performed many many times. In the example below the survey_q_id` = '180002' part is the one that changes.

SELECT COUNT(`result_q`.`id`) as result_q_c 
FROM result_q
LEFT JOIN `survey_save` ON `survey_save`.`id` = `result_q`.`ss_id` 
WHERE `result_q`.`survey_q_id` = '180002' 
AND `survey_save`.`survey_order_id` IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) 
AND `survey_save`.`id` IN (7,9,24,26,29,30,31,33,34,38,39,41,45,46,47,50,51,52,53,64,65,68,74,76,79,81,82,87,92,93,96,115,116,117,131,148,149,150,151,153,155,156,159,160,162,165,166,168,176,179,182,186,188,190,194,201,204,208,210,211,216,221,226,235,236,232,250,266,279,280,283,287,290,298,299,304,307,308,315,317,318,330,342,356,357,358,360,366,367,370,373,374,379,380,383,390,391,394,397,402,404,405,406,408,413,415,436,446,450,458,465,467,468,469,471,473,477,479,480,481,485,489,493,494,500,504,518,521,532,536,539,542,544,568,570,574,576,579,582,584,585,586,594,595,598,621,622,629,651,652,653,657,658,665,669,674,675,684,690,696,700,708,712,715,721,722,723,724,739,740,749,756,757,761,764,765,767,770,775,776,780,787,788,792,800,805,809,813,815,818,821,824,829,830,831,832,841,843,865,866,867,868,869,870,871,872,873,874,875,876,877,878,879,880,881,882,883,884,885,886,887,888,889,890,891,892,893,894,895,896,982,983,986,988,991,993,996,998,1001,1003,1006,1009,1011,1013,1019,1021,1024,1027,1032,1061) 
AND `result_q_radio`.`date_save` > '2019-01-01 00:00:00'
AND `result_q_radio`.`date_save` < '2022-05-16 12:19:44'

So the set of data is limited and then I need to count how many of the partial data set has one of the columns set with an id of x. The survey_order_id and all ids are indexed. The data set is not static - result_q table grows a lot daily but when analysis is performed the range of result_q entries is read-only so no new elements are added to the analysed data set when analysis of it is allowed. So the survey X that consisted of Y number of people filling in generated many entries into the result_q table but when survey is closed these entries don't change - new entries are added but the set that consists of that one survey doesn't change.

The amount of data collected grows a lot daily and I need to be able to tell MySQL that we work on these rows only with each request. Data will be analysed in many ways so these will not be static but on a particular page call big part of many queries will work only on a specific portion of data. I assume there is some caching mechanism that may be optimising this in the background but I would appreciate if someone has any idea for improvement here.

Maybe some additional calls or preparing data set in some way before or some sort of manual caching that would temporarily gather data to be worked on so only that bit is analysed?

Thanks!

PS. Currently working on: PHP7.4 + 10.3.27-MariaDB

  • Nothing in your query is jumping out to me as anything that should be slow. I would think that having indexes on the columns you are filtering on would be the number one thing to speed it up. – Chris Haas May 16 '22 at 11:52
  • If the query in your question is repeated many times in the same script and only `survey_q_id = '180002'` varies, then a [prepared statement](https://www.php.net/manual/en/pdo.prepare.php) will indeed help. You can call the prepared statement many times with only the `survey_q_id` changing. The performance gain will however not be spectacular. Having proper indexes is king here. – KIKO Software May 16 '22 at 11:55
  • I’d also add that it is entirely possible that your queries are blazingly fast, but you have some loops in PHP that are slowing things down. So I’d recommend doing some [performance testing](https://stackoverflow.com/a/1200282/231316), too. – Chris Haas May 16 '22 at 12:01
  • Could you elaborate a bit on the surrounding code? What are you doing in PHP to process the data? – Christoffer May 16 '22 at 12:12
  • Welcome to Stack Overflow! Generally the way to handle subsets of data efficiently is this: use appropriate indexes on the tables. To help you with your [tag:query-optimization] question, we need to see your table definitions, your index definitions, and the output of EXPLAIN. Please [read this](https://stackoverflow.com/tags/query-optimization/info), then [edit] your question. – O. Jones May 16 '22 at 13:24
  • The query is malformed. It seems you want an outer join but the engine is silently converting it to an inner join due to the predicate `survey_save.survey_order_id IN (...)`. Please fix the query first; then, we can help optimizing it. – The Impaler May 16 '22 at 14:50
  • `result_q_radio` seems to be missing from FROM or JOIN. Please fix the query. Also, please provide `SHOW CREATE TABLE` for each table. – Rick James May 20 '22 at 19:09

1 Answers1

0

Thanks everyone! @Chris Haas while all INT columns used in the query were indexed there were other queries as well asking in result_q_radio table that were INT and were not indexed. I actually implemented desperately a caching table that already cut the queries that were already performed before and performance of queries already done was ~5x faster but the problem remained when user clicked to filter results which has to recalculate yet another combination of survey results. I thought, since the data is sort of read-only after the survey-gathering is done for one diagnosis, that I will make a queue system to cache most of the usual combinations and perform CRON to prepare the cache at night hours daily so not to make big load on performance. I also diagnosed my PHP itself but while I can't say it is perfect it didn't seem to be issue that should exponentially slow down when amount of records in db grew.
I also desperately consulted client and hosting company if switching from shared hosting to a dedicated server would allow performance improvement that would at least give us time to develop optimisations.

BUT after lots of stress I visited phpmyadmin and the result table and noticed it has some INT columns that are used elsewhere in queries and with few clicks I indexed them. And speed improvement was astronomical! So all in all solution was the indexing of ALL columns in ALL tables that have INT type and that are used in where queries. So Chris Haas - thank you - your first reply, while neglected by me at first (hey I have them indexed already!) in the end turned out to be the most accurate - I did not check other tables and other queries only focused on the ones that were in this direct query I suspected of being slowest. The same solution was suggested by O. Jones - thank you!

Thanks Christoffer, The Impaler, KIKO Software, O. Jones and Rick James too!

  1. Prepared statements did not sound like worth the effort from KIKO's description
  2. I did not consider silent converting to inner join to be a viable source of problem since this would most probably be cached internally somehow in SQL but in future I will consider this solution as well as an additional optimisation method.
  3. Rick James spotted the typos - the result_q should be result_q_radio everywhere in the code - I only changed it to just q at the beginning and forgot about the remaining ones. Each question type has separate table - radios, check, matrix radios etc. and I didn't want to bring attention to why it is radio in the question as all the tables work similarly.

Thank you all and have a great day!