2

Ive been tasked with linking the ids of two different API's, the linking will be done based on names, therefore the searches use wildcard are a bit slow.

For example- One api uses the name Lionel Messi, while the other uses Lionel Andrés Messi. To solve this queries are done by doing

select id from players WHERE name LIKE '%Lionel%Messi%'

This proves effective but slow with queries taking an average of .3 seconds, and with 100k searches necessary this will take all day.

Since the slow bit is the query, would it be possible for my php program to be multi threaded so that multiple queries could run at the same time.

Would it be as simple as splitting the list of 100k searches into 4 lists of 25k, and just running the script in 4 different web pages?

EDIT-BTW the column "name" is an index in the table "players" however that seems to have little to no impact on speed

Marc-9
  • 145
  • 1
  • 9

1 Answers1

1

Yes, it sounds like this can be done multi-threaded, as each operation (linking a single pair of IDs) doesn't depend on the results of previous operations. To get the best performance, you would split the input (the table) into as many lists as you have processor cores. The split could be done multiple ways depending on your requirements, e.g. ID ranges, splitting into several different tables, etc. And yes, running the script in multiple browser windows should create the desired parallelisation, making use of all available CPU cores. It may depend on how your server (Apache, nginx, etc) is configured, but I think most servers in their default configuration will get this right.

To elaborate on why the index doesn't have any effect -- an index is just a data structure that allows you to kind of reverse the way the basic function of selecting a row works, in order to find rows where a column matches a particular value. So instead of the input being a row number (not an id but an actual offset into the data that locates the row in physical storage) and the output being a row, the input is a column value (e.g. a numeric ID or a string) and the output is a list of row numbers that match that value. Various data structures are used, but the mechanism depends on the actual value (e.g. the ID) being stored on disk in a data structure. So the reason that wildcards aren't indexed is that every possible wildcard matching each unique value would have to be stored on disk.

Edit as detailed in the answers linked in a comment (Mysql Improve Search Performance with wildcards (%%)), MySQL can use indexes with wildcards as long as the string doesn't start with a wildcard -- presumably because rows can be eliminated immediately based on the start of the string.

Gus Hogg-Blake
  • 2,393
  • 2
  • 21
  • 31
  • With these improvements the time for my script to execute went from 9 hours to 2 minutes. While the multi threading was awesome, removing the wildcard from the beginning of the string was the difference maker. Thanks a ton – Marc-9 Jun 14 '21 at 00:34
  • Interesting! I guess the index can be used as long as the string doesn't start with a wildcard. – Gus Hogg-Blake Jun 14 '21 at 09:39