0

This is kind of a follow up to my previous question link.

Some time has passed so I wasn't sure if I needed to update that one or create a new one, especially since the parameters of my issues have changed a little.

--

I have a large (simple) table of phone numbers (>5 million and growing). The table was set up as follows:

| AreaCode | Local | Created |
  • 'Local' is the 7 digits following the 3 digit area code
  • 'Created' is just a timestamp
  • AreaCode and Local are both indexed
  • I used to have an ID column as my primary key but removed it when I partitioned my table. I set up 10 partitions by range on AreaCode ( <100, <200, <300, ~ <999)

I am uploading csv files of numbers (up to 250k rows), sanitizing and cleaning the input via PHP, then creating a temp table and inserting data into it.

After this point, I am having a lot of trouble. With smaller data sizes (10k-25k rows), I don't really have any problems. But when I'm trying to compare my temp table containing 250k+ rows against my master database, it takes incredibly long.

I've tried the 2 following queries, and not much luck with any of them.

Using inner join

SELECT a.* FROM master_table a
INNER JOIN temp_table b
ON a.AreaCode = b.AreaCode
AND a.Local = b.Local;

I found this suggestion on a website and tried it

SELECT b.* FROM temp_table b
WHERE b.AreaCode
IN (
    SELECT a.AreaCode
    FROM master_table a
    WHERE a.AreaCode = b.AreaCode
    AND a.Local = b.Local
);

I apologize for the long list questions but my mastery of mysql is weak.

  1. Am I making a mistake by not having a primary and/or unique key? Since every phone number is unique, I wasn't sure that having an ID column would benefit me.
  2. Am I right to partition my master table or is this slowing me down?
  3. I have indexes on my AreaCode and Local columns. When I create temporary tables, should I be creating indexes on the same columns there as well?
  4. Please help me fix my query so it doesn't take so long!!
Community
  • 1
  • 1
ahnkee
  • 125
  • 10
  • Did you look at the query plan output to see if both of your queries are doing table scans? That would be the first thing I'd look at. Also, I would put an index on the temp tables because if you don't, you're effectively forcing a table scan during the subqueries. – Timo Geusch Jun 14 '13 at 22:59
  • Do you have separate indexes on each column, or a composite index on `(AreaCode, Local)`? – Barmar Jun 14 '13 at 23:12
  • @TimoGeusch Would that be using Explain? I did pull it yesterday and I am trying to make sense of it right now, tbh. And ty for the clarification regarding indexes on the temp table. I have been doing it but wasn't sure if I should or not. – ahnkee Jun 15 '13 at 04:10
  • @Barmar I am using separate indexes. Is that not the way to go? – ahnkee Jun 15 '13 at 04:11
  • Only one index can be used when optimizing a join. If you have separate indexes, it will pick one of them, and then have to do a scan to match the other column. If you use a composite index it can use it to match both columns at once. – Barmar Jun 15 '13 at 05:07
  • Ah I see. Thank you. I will run my script after setting up a composite index as you suggest and hopefully that will have a positive impact on performance. – ahnkee Jun 17 '13 at 08:00

1 Answers1

1

To answer your questions:

  1. I do not see how you are enforcing uniqueness with your current indexes. You could have a compound primary index on areaCode and local to enforce this uniqueness. I would have some sort of primary key for sure. I would question through whether you ever needed to query local without areaCode to determine if you need individual indexes on. Personally, I would probably use an autoincrement primary key and a compound unique index on those two field if I was going to be referencing that data in other tables (like if I wanted to related phone numbers to users or something) as I find it less cumbersome to relate tables when using a single key.

  2. 5 million rows is not that big of a table. Probably premature to partition. Also, depending on the ratios of different areas codes in your database, and the access pattern for these, this may not be a good partitioning scheme.

  3. If you are going to work with on-disk temporary tables and going to be joining with these large data sets, you need to provide indexes.

  4. You have two different queries that do two different things here. If the intent is ultimately to insert this data into the master table data, I don't understand why you are trying to do a join at all. You could simply do something along the lines of:

>

 INSERT INTO master_table (`areaCode`, `local`)
 SELECT SELECT `areaCode`, `local`
 FROM temp_table
 ON DUPLICATE KEY UPDATE UPDATE `created` = NOW() /* You can add this line is you want to update the time stamp */
Mike Brant
  • 70,514
  • 10
  • 99
  • 103
  • You're right, I'm really not. My understanding on db structures and all is pretty weak so thank you for your detailed explanation. Would it be more efficient to have a compound primary key on the two columns, or should I just create a 3rd column using the full phone number and use that as a primary key? Since I should only have one of each number, would that be an acceptable approach? At what point would you recommend using partitions? I was reading that partitions can really help if used right, but also hurt a lot if utilized incorrectly (as I suspect I'm doing). – ahnkee Jun 15 '13 at 04:15
  • Basically, what I'm doing here is the first step of my process. I'm trying to find the duplicates first so that I can shoot out a csv file with those numbers. Then I need to produce another file with the unique numbers. I do need to insert new data, but also need to retain duplicate info. So what I'm thinking is 1) compare temp vs master tables, find duplicates, store those values into an array. 2) run a query similar to #1 to find and store unique values. 3) insert new values into master table. – ahnkee Jun 15 '13 at 04:20
  • @ahnkee One other possible approach would be to insert items directly into the table as I have proposed, but insert it with a flag that you can use for post-processing which records were newly added. As you process these records you can set the flag back to default state. – Mike Brant Jun 17 '13 at 15:11
  • Ah, I see. I see the merit in your suggestion and how it'll help me eliminate a step, even, and make my script run quicker. This might be a low level question but a flag is an additional column that I would create, correct? – ahnkee Jun 17 '13 at 17:23
  • Awesome. Thanks for your help! Will be implementing these changes today. Will update you on how it all goes! – ahnkee Jun 17 '13 at 18:10
  • If I am inserting a list of 250k rows into a database of 5 million, how long it should take, do you think? I feel like that's longer than what it should be.. I'm running a test of about 250k rows vs 1million and it's taking like 2-3 min. – ahnkee Jun 17 '13 at 21:58
  • @ahnkee That would depend on a lot of different factors. It doesn't sound unreasonable though. – Mike Brant Jun 17 '13 at 22:08
  • just wanted to update you and let you know you were a tremendous help, so thank you. Using a lot of your suggestions, I was able to get back on track with my project and get it to a pretty efficient level. – ahnkee Jun 19 '13 at 16:47