1

I translated the problem into an employee/salary problem for simplicity.

Having an employee record emp such as:

| id | salary (in 1000s) |

Given a number 'num', find salary 'sal' where the number of employees receiving salary<=sal are >=num (similar to area under curve problems in statistics). We are using Python and SQLite, but the problem is not specific to them:

I'm doing the following (naive starting solution):

num = some_num
sal = 1000 # starting miminmum value
count = 0
while count < num:
    sql = 'select count(*) from (select 1 from emp where salary<=? limit ?)' 
    # using limit so that we don't keep counting more than num - might help (?)
    (count,) = cursor.execute(sql, (sal, num)).next() # using apsw sqlite adapter
    sal += 1000

    print sal

How can we make this more efficient? (Algorithmically or using standard SQL or equivalent, but not using quirks of a given system)

Or else: can it be made more efficient by adding extra fields to the record that can be kept up-to-date on insert/update ops without too much overhead?

VLAZ
  • 26,331
  • 9
  • 49
  • 67
Basel Shishani
  • 7,735
  • 6
  • 50
  • 67

1 Answers1

1

If you are using a prepared statement, I believe you can move the preparation step out of the loop to make it much faster.

sql = 'select count(*) from (select 1 from emp where salary<=? limit ?)' 
# using limit so that we don't keep counting more than num - might help (?)
while count < num:
    (count,) = cursor.execute(sql, (sal, num))
    sal += 1000

If you further want to improve performance and your db size is reasonably small, you could load the entire data into an array and do your operation.

I think further optimization is possible if you sort the array by salary first. After that you can do things like binary Search to where the < condition flips and the index of that point + 1 would be the count.

The solution is simpler than it looks. If the records are sorted by salary then the #num'th record's salary would be the desired answer, so this becomes a problem of selecting the n'th row:

num = some_num
sql = 'select salary from emp order by salary limit 1 offset ?'
(sal,) = cursor.execute(sql, (num-1,)).next()
print sal
VLAZ
  • 26,331
  • 9
  • 49
  • 67
Karthik T
  • 31,456
  • 5
  • 68
  • 87
  • Thanks. That would help a bit I guess, but we're still doing many count ops, and also if there are big gaps in salary then many of these ops are needless. – Basel Shishani Dec 14 '12 at 05:58
  • @BaselShishani could you please read my edit? looks like the solution is simpler than it looks.. – Karthik T Dec 14 '12 at 06:20
  • Yes if we sort by salary and select the num'th record, the total records below that salary value (inclusive) would be equal to num. – Basel Shishani Dec 14 '12 at 07:03
  • @BaselShishani in that case a one liner SQL statement might suffice but I am not good enough to do it, for one thing there seems [no standards compliant way](http://stackoverflow.com/questions/16568/how-to-select-the-nth-row-in-a-sql-database-table) to select a row. – Karthik T Dec 14 '12 at 07:05