1

How can I improve the following function from a speed point of view? Ideally, I'd like this to use executemany as part of this up process.

While the function works fine, I'm sure there is a more efficient way to do this; check if a value exists and update/insert as applicable.

I need to do this daily across millions of data.

def insert_or_update(self, profile_id, landing_page, keyword, position, impressions, clicks, ctr):
    ''' checks if a entry exists, if not it's inserted. If it is, the metrics
    are updated.
    '''
    try:
        self.cursor.execute('select id, position, impressions, clicks, ctr from temp where profile_id={} and keyword="{}" and landing_page="{}"'.format(profile_id, keyword, landing_page))
        data = self.cursor.fetchone()
        if data:
            row_id = data[0]
            sql_impressions = data[2] + impressions
            sql_clicks = data[3] + clicks
            sql_ctr = sum([data[4], ctr]) / len([data[4], ctr])
            # if the keyword/landing_page exists
            self.cursor.execute("update temp set position={}, impressions={}, clicks={}, ctr={} where id={}".format(position, sql_impressions, sql_clicks, sql_ctr, row_id))
            # Commit your changes in the database
            self.db.commit()
            return self.cursor.lastrowid
        else:
            # if the keyword/landing_page doesn't exist
            self.cursor.execute("insert into temp (profile_id, landing_page, keyword, position, impressions, clicks, ctr) values (%s, %s, %s, %s, %s, %s, %s)", (profile_id, landing_page, keyword, position, impressions, clicks, ctr))
            # Commit your changes in the database
            self.db.commit()
            return self.cursor.lastrowid
    except Exception as e:
        return e
        # Rollback in case there is any error
        self.db.rollback()
    finally:
        self.db.close()
Adders
  • 665
  • 8
  • 29

1 Answers1

9

There's a bunch of performance problems here if you need to do this millions of times.

  • You're preparing the same SQL statement over and over again, millions of times. It would work better to prepare it once and execute it millions of times.

  • You're disconnecting from the database on every function call after a single query. That means you need to reconnect each time and any cached information is thrown away. Don't do that, leave it connected.

  • You're committing after each row. This will slow things down. Instead, commit after doing a batch.

  • The select + update or insert can probably be done as a single upsert.

  • That you're inserting so much into a temp table is probably a performance issue.

  • If the table has too many indexes that can slow inserts. Sometimes it's best to drop indexes, do a big batch update, and recreate them.

  • Because you're putting values directly into your SQL, your SQL is open to a SQL injection attack.


Instead...

  • Use prepared statements and bind parameters
  • Leave the database connected
  • Do updates in bulk
  • Only commit at the end of a run of updates
  • Do all the math in the UPDATE rather then SELECT + math + UPDATE.
  • Use an "UPSERT" instead of SELECT then UPDATE or INSERT

First off, prepared statements. These let MySQL compile the statement once and then reuse it. The idea is you write a statement with placeholders for the values.

select id, position, impressions, clicks, ctr
from temp
where profile_id=%s and
      keyword=%s and 
      landing_page=%s

Then you execute that with the values as arguments, not as part of the string.

self.cursor.execute(
   'select id, position, impressions, clicks, ctr from temp where profile_id=%s and keyword=%s and landing_page=%s',
   (profile_id, keyword, landing_page)
)

This allows the database to cache the prepared statement and not have to recompile it each time. It also avoids a SQL injection attack where a clever attacker can craft a value that is actually more SQL like " MORE SQL HERE ". It is a very, very, very common security hole.

Note, you might need to use MySQL's own Python database library to get true prepared statements. Don't worry about it too much, using prepared statements is not your biggest performance problem.


Next, what you're basically doing is adding to an existing row, or if there is no existing row, inserting a new one. This can be done more efficiently in a single statement with an UPSERT, a combined INSERT and UPDATE. MySQL has it as INSERT ... ON DUPLICATE KEY UPDATE.

To see how this is done, we can write your SELECT then UPDATE as a single UPDATE. The calculations are done in the SQL.

    update temp
    set impressions = impressions + %s,
        clicks = clicks + %s,
        ctr = (ctr + %s / 2)
    where profile_id=%s and
          keyword=%s and
          landing_page=%s

Your INSERT remains the same...

    insert into temp
        (profile_id, landing_page, keyword, position, impressions, clicks, ctr)
        values (%s, %s, %s, %s, %s, %s, %s)

Combine them into one INSERT ON DUPLICATE KEY UPDATE.

    insert into temp
        (profile_id, landing_page, keyword, position, impressions, clicks, ctr)
        values (%s, %s, %s, %s, %s, %s, %s)
    on duplicate key update
    update temp
    set impressions = impressions + %s,
        clicks = clicks + %s,
        ctr = (ctr + %s / 2)

This depends on what the keys of the table are defined as. If you have unique( profile_id, landing_page, keyword ) then it should work the same as your code.

Even if you can't do the upsert, you can eliminate the SELECT by trying the UPDATE, checking if it updated anything, and if it didn't doing an INSERT.


Do the updates in bulk. Instead of calling a subroutine which does one update and commits, pass it a big list of things to be updated and work on them in a loop. You can even take advantage of executemany to run the same statement with multiple values. Then commit.

You might be able to do the UPSERT in bulk. INSERT can take multiple rows at once. For example, this inserts three rows.

insert into whatever
    (foo, bar, baz)
values (1, 2, 3),
       (4, 5, 6), 
       (7, 8, 9)

You can likely do the same with your INSERT ON DUPLICATE KEY UPDATE reducing the amount of overhead to talk to the database. See this post for an example (in PHP, but you should be able to adapt).

This sacrifices returning the ID of the last inserted row, but them's the breaks.

Community
  • 1
  • 1
Schwern
  • 153,029
  • 25
  • 195
  • 336
  • 1
    Thanks mate, that's a lot for me to go off and keep me busy!! :) – Adders Dec 12 '16 at 16:32
  • on insert there is 7 values but only 3 for the update, how would you pass both sets in? – Adders Dec 12 '16 at 21:26
  • I've sorted it - Just want to say thanks again. It's running like the clappers now! – Adders Dec 12 '16 at 22:01
  • @Adders You probably figured out that the duplicate key (which can be multi column) acts like a where clause for the update. That's why you'd probably need to define a multi part unique key. Glad you got it sorted! – Schwern Dec 12 '16 at 22:24