How to efficiently select and group by a substring of a column

Question

I currently have two tables in MySQL InnoDB database. Here are two simplified versions of the tables.

Table 1 (PropData):

+-----------------------------------+
|  PropData                         |
+-----------+-----------------------+
|  BigTag   |  Date                 |
+-----------+-----------------------+
|  10001AB  |  1000-01-01 00:00:00  |
+-----------+-----------------------+

Table 2 (LimitTags):

+-------------------+
|  LimitTags        |
+---------+---------+
|  Tag    |  Model  |
+---------+---------+
|  10001  |  Base   |
+---------+---------+

Usage:

The PropData table contains a bunch of data that I need to save and plot, and each record in that table has a certain tag/model from the LimitTags table that applies to it (i.e. many PropData records can refer to one LimitTag record).

Goal:

I'm currently trying to create a query will retrieve a list of unique tag and model pairs by searching through the PropData table in some specified timespan and figure out which tag/model pairs (from the LimitTags table) exist.

Ideally I think this would be pretty doable by JOINing the Tag fields from both tables, including a WHERE clause to get the data within my timespan and then SELECTing the distinct tags and models.

The snag is that the Tag in the PropData table has two more chars - than the equivalent tag in the LimitTags table - appended to it (hence the name BigTag), which means I can't just JOIN the tables together.

What I've Tried:

SELECT DISTINCT S.Tag, S.Model
FROM (
    SELECT  T.Tag, 
            T.Model,
            P.Date
    FROM (
        SELECT LEFT(BigTag, 5) AS Tag, Date
        FROM PropData
        GROUP BY LEFT(BigTag, 5)
    ) AS P
    JOIN LimitTags T ON T.Tag=P.Tag
) AS S
WHERE S.Date BETWEEN '1000-01-01 00:00:00' AND '9999-12-31 23:59:59'
ORDER BY S.Date DESC;

The Issue:

The problem with the above query is that the first subquery that forms P is incredibly slow and searches the entire table, which is millions of records.

Running the explain command confirms this by telling me that the query is using filesort and that the type is ALL.

I'm basically stuck trying to figure out how to select all the records from PropData, where I only take the first 5 characters of the BigTag column (to easily join with LimitTags), and the Date (for only getting data between my timespan), and where I am only getting records in which the first 5 characters of the Tag are distinct (hence the group by).

I think the issue stems from how I'm using the Left() function in my select and group by statements (shown below), but I haven't figured out how to get around it.

SELECT LEFT(BigTag, 5) AS Tag, Date
FROM PropData
GROUP BY LEFT(BigTag, 5)

I also have an index on (BigTag) and (BigTag(5), Date) for PropData, but neither of those are used in the query for P. There is also an index on Tag for LimitTags and that is used during the join operation.

I'm pretty new to MySQL and writing queries in general and could definitely use some advice on how to accomplish this. This is also my first post on SO and I'm hoping I haven't screwed anything up!

Thanks in advance!

Edit 1:

From Strawberry's comment, I had the idea to create a new table that contains only two columns which are the id (primary key, not shown in the table structure above) and the shortened tag column with only the first 5 characters of the equivalent record's BigTag column. Then I can add a trigger to add or delete records in this table to match the original PropData table. I'll post back if that works well and there are no other solutions that work for just the existing tables.

Edit 2:

So I tried out what I said in the paragraph above, and I'm getting some really weird results. After creating a new table called PropDataTag with the columns being the Id and the first 5 chars of the BigTag from PropData. I tried performing this query:

SELECT DISTINCT T.Tag, T.Model
FROM PropData P
   JOIN PropDataTag N ON P.Id=N.Id
      JOIN LimitTags T ON N.Tag=T.Tag
WHERE P.Date BETWEEN '0000-01-01 00:00:00' AND '9999-12-31 23:59:59'

The first time I ran it, it went perfectly and I got my results in 0.016 sec. I also ran the explain command and got the following results. When I took the screenshot, I unfortunately neglected to expand the ref column so I don't know what two of the values are and am having trouble reproducing these results.

id  select_type  table  type    possible_keys                 key             key_len   ref     rows     Extra 
1   SIMPLE       T      index   IdxTag,IdxTagAndModel         IdxTagAndModel  49        NULL    1427     Using index; Using temporary
1   SIMPLE       N      ref     IdxTag                        IdxTag          7         NoIdea  1238     Using index; Distinct
1   SIMPLE       P      eq_ref  PRIMARY,IdxDate,IdxIdAndDate  IdxDate         38        NoIdea  1        Using where; Distinct

Thinking everything was well in hand, I then tried the exact same query on the other 3 PropData tables (mentioned in comments) and UNIONed them together. After running the query, it took more than 2 minutes before I stopped it for taking too long. I tried running the explain command on the original query shown above, and instead of getting the same results, I got the following:

id  select_type  table  type    possible_keys                 key         key_len   ref    rows     Extra 
1   SIMPLE       P      range   PRIMARY,IdxDate,IdxIdAndDate  IdxDate     24        NULL   1785585  Using where; Using Index; Using temporary
1   SIMPLE       N      eq_ref  PRIMARY,IdxTag                PRIMARY     38        P.Id   1
1   SIMPLE       T      ref     IdxTag,IdxTagAndModel         IdxTag      7         N.Tag  1

What just happened? I can't figure out:

Why did the order of the rows in the explain output change?
Why did MySQL decide to use different indices all of a sudden?

Does anyone have any ideas or input?

Can you try `SELECT Distinct T.Tag, T.Model, P.Date FROM PropData P JOIN LimitTags T ON T.Tag=LEFT(P.BigTag, 5) WHERE P.Date BETWEEN '1000-01-01 00:00:00' AND '9999-12-31 23:59:59' ORDER BY P.Date DESC;`? It moves the LEFT(BigTag, 5) into the join clause and eliminate all your subqueries, but should (hopefully) give the same results, faster. — Ghost, Jul 08 '16 at 23:18
@Ghost I spent a bit of time messing around with variations of your suggestion but it doesn't seem to work. I removed the P.Date from select because I don't actually need to return that and using distinct with the date is pointless since it would return every record. Just this: `SELECT SQL_NO_CACHE DISTINCT T.Tag, T.Model, P.Date FROM PropData P JOIN LimitTags T ON T.Tag=LEFT(P.BigTag, 5)` Takes 20+ sec and I think it's because of having Left() within the JOIN clause. I think I remember reading somewhere that MySQL can't index properly when you use a function in the JOIN clause — Aftab Narsimhan, Jul 09 '16 at 00:38
@Ghost Btw your suggestion still does perform way better than mine does. The example I have is pretty simplified and the reason I structured it this way is because there are actually 4 different PropData tables that are UNIONed together. My full query takes over 20 min while yours take under 1 which is still a solid improvement! I think the downside is that the complexity of this query seems to be about O(1N) for each PropData table I perform the join on. Ideally I'd like it to be constant time if I can figure out how to write this query in a way which uses indexes properly. — Aftab Narsimhan, Jul 09 '16 at 02:13
@Strawberry I had to look up what normalized meant, but I strongly agree that it would be the easiest way. Long story short, this is a fixed database and table structure that needs to support legacy applications, which means I can't change the structure. All of the PropData tables will likely have hundreds of millions of entries in a short period of time and it doesn't seem logical for me to create a new table with the same data except normalized so that I could split the BigTag column into two columns with the first 5 chars in one and the last two in another for example — Aftab Narsimhan, Jul 10 '16 at 18:58
The legacy stuff could access a denormalised view of the new, normalised schema — Strawberry, Jul 10 '16 at 19:16
@Strawberry I understand that and despite the fact that that is simple to implement, right now it just isn't an option since those legacy applications are fixed and deployed. Unless there is literally no other viable option, there's no way I could tell my manager that I suggest digging up the source code for several older applications, implementing the changes you're suggesting, and then retesting these applications to make sure nothing is broken. But see the edits in the OP for what I will try on Monday. — Aftab Narsimhan, Jul 10 '16 at 20:10
I don't know about other solutions, but maybe others can help. — Strawberry, Jul 10 '16 at 20:17