0

I loop through a number of messages with an unixtime as well as an userid, where I want to find the number of messages inside a 24 hr time slot for each user. I posted my code on codereview to get some help. From there I optimized the

cur.execute('SELECT unixtime FROM MessageType1 WHERE userID ='+str(userID[index])+' ORDER BY unixtime asc')

query as I found that it compromized of 6.7s of the total 7.2s running time of my code. I optimized it by making indexes on the unixtime column using CREATE INDEX unixtime_times ON MessageType1 (unixtime asc). Now that query takes 0.00117s opposed to 6.7s, but the overall time it takes to run my code has gone from 7.2s to 15.8s. Everything is unchanged except for the indexes. Upon further inspection, it seems like messages = cur.fetchall() is taking 15.3s after implementing indexes. Anyone have clue why? Thanks in advance!

con = lite.connect(databasepath)
    userID = []
    messages = []
    messageFrequency = []
    with con:
        cur = con.cursor()
        #Get all UserID
        cur.execute('SELECT DISTINCT userid FROM MessageType1')
        userID = cur.fetchall()
        userID = {index:x[0] for index,x in enumerate(userID)}
        #For each UserID
        for index in range(len(userID)):
            messageFrequency.append(0)
            #Get all MSG with UserID = UserID sorted by UNIXTIME
            cur.execute('SELECT unixtime FROM MessageType1 WHERE userID ='+str(userID[index])+' ORDER BY unixtime asc')
            messages = cur.fetchall()
            messages = {index:x[0] for index,x in enumerate(messages)}
            #Loop through every MSG
            for messageIndex in range(len(messages)):
                frequency = 0
                message = messages[messageIndex]
                for nextMessageIndex in range(messageIndex+1, len(messages)):
                #Loop through every message that is within 24 hours
                    nextmessage = messages[nextMessageIndex]
                    if  nextmessage < message+(24*60*60):
                    #Count the number of occurences
                        frequency += 1
                    else:
                        break
                #Add best benchmark for every message to a list that should be plotted.
                if messageFrequency[-1]<frequency:
                    messageFrequency[-1] = frequency
Community
  • 1
  • 1
bjornasm
  • 2,211
  • 7
  • 37
  • 62
  • 1
    you should create an index on column userId (the condition of the WHERE clause) not for the purpose of sorting. I doubt that creating an index on unixtime will do any good in your case. – miraculixx Mar 08 '15 at 16:42
  • 1
    You are reselecting each and every user by the user_id... even though 1 of these queries is fast, your forloop will take forever. – Antti Haapala -- Слава Україні Mar 08 '15 at 16:45
  • Thank you for your comment @Miraculixx. It seems like indexing the unixtime column shaves off 6.7 seconds in the select query, but adds 15 s to the fetchall query. So it kind of does any good, but that gets eliminated when I fetchs all values. Could a joint index on unixtime and userid help? I shall try to create an index on userId as well. – bjornasm Mar 08 '15 at 16:47
  • 1
    Avoid cursors!! RDMS are optimised for set based operations. Making use of cursors defeats that idea as it needs to fetch records from physical storage one by one as opposed to whole set, which in turn slows down it drastically – Ubaid Ashraf Mar 08 '15 at 16:47
  • @Miraculixx - Adding an index on userid led the total time to 1s, thank you! – bjornasm Mar 08 '15 at 16:51
  • @antti-haapala Thank you for your answer. What do you mean by reselecting? – bjornasm Mar 08 '15 at 16:51
  • @ubaid-ashraf-masoody Thank you for your answer. Do I understand correctly that you mean that I expose the result? I must note that this code will never run live, but only on my laptop. – bjornasm Mar 08 '15 at 16:51
  • 1
    @bjornasm You are making use of cursors, which in turn is very slow. If you somehow remove cursors, it will work faster – Ubaid Ashraf Mar 08 '15 at 17:10
  • @ubaidashrafmasoody Thank you! Do you have any hint on how I shall avoid cursors in this code? – bjornasm Mar 08 '15 at 17:30
  • Chek this http://stackoverflow.com/questions/58141/why-is-it-considered-bad-practice-to-use-cursors-in-sql-server @bjornasm – Ubaid Ashraf Mar 08 '15 at 17:39

1 Answers1

2

The best index for this query:

SELECT unixtime
FROM MessageType1
WHERE userID ='+str(userID[index])+'
ORDER BY unixtime asc

is MessageType1(UserId, unixtime).

With an index on unixtime only, the database has basically two possible execution plans:

  1. It can ignore the index, read the rows "sequentially", filter them, and then do the sort.
  2. It can return the rows from the index, in sorted order, and then filter on output.

My guess is that it is choosing the second approach, based on your timings. The "fetch" component of the processing ends the execution of the query, so that is really fast. It then has to read the entire table to get the results you want.

This approach can take longer than just reading the data in order, because of locality issues. Without the index, it would read the first page and all the records on the first page. With the index, each record is on a random page -- no locality. This can be particularly problematic when the table is larger than memory available for the page cache, and you end up with a situation known as "thrashing".

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786