1

I have a table of a ton of data--it's just a single column of a million phone numbers.

I'm working on a PHP script that will receive an uploaded file of numbers, and I'd like to run each number against the database to see if the number already exists. If it does exist, I just need a true returned and no other data.

I have 2 main questions:

  1. For performance/speed consideration, would it make a difference if
    • The column was broken down into two (Column A with Area Code, Column B with remaining digits). That way I could run a query using WHERE to find a matching Area Code for a given number, then match the remaining 7 digits?
    • Or would that make no difference than matching the entire 10 digits?
  2. The answer to this question, Best way to test if a row exists in a MySQL table, was:

    SELECT EXISTS(SELECT 1 FROM table1 WHERE ...)
    

    If I'm not mistaken, that will simply return the value "1" if what you're searching for exists, correct? Since I'll be receiving the results in PHP, would this be the most efficient way to go?

Community
  • 1
  • 1
ahnkee
  • 125
  • 10
  • How big will the uploaded list of numbers be ? – SteveP Mar 22 '13 at 09:11
  • 1
    If your goal is to add the record if it doesn't exit, then IF there is a unique key on the telephone number as Andrew Cranston mentions, then any insert matching existing data will fail, but any new ones will be successful. – Waygood Mar 22 '13 at 09:27
  • It will probably reach somewhere around 8 million total numbers @SteveP – ahnkee Mar 22 '13 at 09:31
  • How many numbers will be in the uploaded file you are checking ? – SteveP Mar 22 '13 at 09:32
  • I'll be checking anywhere from 5k - 10k numbers at a time @SteveP – ahnkee Mar 22 '13 at 09:40

5 Answers5

3

If the numbers are small (and will always be small), then just using an IN would be fine.

If the numbers are potentially large then I would load them into a temp table and do a JOIN between that temp table and you existing table. Using an inner join only matching records would be returned, and easy to just return the number and true.

If you are dealing with thousands of numbers then:-

Create a temporary table (probably using the number as a unique key). Load the temporary table with the numbers. Join the temp table against you existing table:-

SELECT a.phoneNumber
FROM ExistingNumbers a
INNER JOIN SomeTempTable b
ON a.phoneNumber = b.phoneNumber

That would get you a list of all the numbers that already exist on the table.

If you would prefer a list of all the number that do not already exist on the table (possibly if you would want to add them after some further processing):-

SELECT b.phoneNumber
FROM SomeTempTable b
LEFT OUTER JOIN ExistingNumbers a
ON a.phoneNumber = b.phoneNumber
WHERE a.phoneNumber IS NULL 
Kickstart
  • 21,403
  • 2
  • 21
  • 33
  • The numbers will always be 10 digits long, since we are working with phone numbers. In terms of SQL, I'm going to assume that it's considered a 'small' number, correct? Or were you thinking not the length of the numbers, but the quantity of numbers being checked? – ahnkee Mar 22 '13 at 09:39
  • 1
    I meant quantity rather than the length of individual numbers. If you are checking a few dozen numbers then use IN. If using hundreds then probably still use IN, but consider if the quantity will ramp up. If thousands then don't use IN. You could use similar functionality on your existing table to check for duplicates. – Kickstart Mar 22 '13 at 09:51
  • I expect to be checking thousands at a time, upwards to 10k. What would you suggest is the best way to go? – ahnkee Mar 22 '13 at 10:05
  • 1
    Create a temporary table, load it with the numbers you want to check and then do a select. I will modify my original post. – Kickstart Mar 22 '13 at 10:09
  • Ah, what you suggested earlier. Thank you for your input. It's been a good learning experience tonight and I will try out all suggestions tomorrow and see what works best for me. While I understand it in theory, I've never done temp tables before so hopefully I don't trip up on it too badly tomorrow. – ahnkee Mar 22 '13 at 10:14
  • Temp tables are easy enough. Don't worry about them. The only minor difficulty is in testing them in phpmyadmin (they are temporary, so phpmyadmin creates them but mysql deletes them once the session ends and the screen is displayed) – Kickstart Mar 22 '13 at 10:16
2

For performance reasons I would select all numbers in one query.

SELECT `numbers` FROM `phone_numbers` WHERE `number` IN('123', '456', '789')

Then you can iterate over the result and check if a number is contained in it or not. If a number isn't contained then append it to your insert query. So you have only two queries. One for select and one for insert.

To your performance question. You should have an index on your field which stores the phone numbers.

akkie
  • 2,523
  • 1
  • 20
  • 34
  • 1
    Index is a must, but as phone numbers are unique it won't be a terribly efficient index. An enhancement you can make is to have a column that stores only the first section of the number. Eg area code. Then shard on this section of the index. – Andrew Cranston Mar 22 '13 at 09:22
  • Using the IN approach is fine, unless there are lots of numbers to check. – SteveP Mar 22 '13 at 09:25
  • My understanding of SQL isn't awesome. I read up on a lot of resources today and was wondering if having a column of area codes and indexing that would help, so thank you @AndrewCranston for that tip. – ahnkee Mar 22 '13 at 09:34
  • We are expecting our database to reach around 8 million numbers. Adding in new entries is important, but we are currently more concerned about checking for existing duplicates (not sure if this info is relevant to anything). But given that expectation, is using IN still a good way to go? – ahnkee Mar 22 '13 at 09:37
  • 1
    @SteveP Please see http://www.pythian.com/blog/debugging-in-vs-or-performance-in-mysql or http://stackoverflow.com/a/2481458/2153190. – akkie Mar 22 '13 at 09:41
  • @akkie thanks for those references. I would not have considered putting thousands of items in one query. – SteveP Mar 22 '13 at 09:46
  • Great links. Thank you @akkie, I think I have a solid plan to follow now. – ahnkee Mar 22 '13 at 09:48
  • @AndrewCranston Please can you explain why this index won't be a terribly efficient index? Good the index would be really huge. But a unique index has a selectivity of 1 and a better selectivity isn't possible. – akkie Mar 22 '13 at 10:05
  • @akkie Indexes with max cardinality slow down writes, but as you've said, you don't mind about writes.. just uniqueness. Sounds like a hash index would be the best fit - O(1) (+Overhead to hash the string). – Andrew Cranston Mar 22 '13 at 12:10
1

Just wanted to add one more thing, the index won't work if the IN clause contains large entries. After a few no of entries the whole table will be scanned in a normal way, ignoring the index on 'numbers'.

For the above mentioned reason, I would prefer @Kickstart solution of using temp tables and inner join.

VikasG
  • 579
  • 6
  • 14
0

I'd do it like

SELECT COUNT(1) FROM table WHERE condition LIMIT 1
Trolley
  • 2,328
  • 2
  • 23
  • 28
0

This is usually how I check if a searched data is stored in the table

<?php
function checkifexist($value){
$link = mysql_connect("localhost", "root", "password");
mysql_select_db("database", $link);    

$result = mysql_query("SELECT column FROM table where column='".$value."' ", $link);
$num_rows = mysql_num_rows($result);

if($num_rows > 0)
{
return true;
}
else
{
return false;
}
}
?>

and then this

<?php 
$input = 'test';
if(checkifexist($input))
{
// the script returned true, now do something
}
?>

all it does is count the number of results and made a conditional statement

Hobo
  • 33
  • 3
  • That's where I was initially headed, and started to wonder if I could benefit from splitting my column into two (area code, rest of digits) and indexing the area code column. Thanks for sharing your code though, glad to know I was on the right path! – ahnkee Mar 22 '13 at 09:43