0

I have a pretty long python process that is meant to reprocess a massive amount of data from one table, extract the necessary pieces, assign appropriate values and reenter it into another table. The process works great! Except that it is incredibly time intensive. I wonder if there is a way to modify this using an executemany approach? These are snippets of the entire process but delineate where I'm hoping to modify to an executemany

The following code runs through a series of if/then statements to assign the appropriate values to the retrieved information.

    # Current crawl begin date
    cur.execute('SELECT Crawl_Begin_Date FROM Crawl WHERE Crawl_ID=?', current_crawl_ID)
    current_crawl_begin_date = cur.fetchone()
    current_crawl_begin_date = current_crawl_begin_date[0]
    
    # Current crawl end date
    cur.execute('SELECT Crawl_End_Date FROM Crawl WHERE Crawl_ID=?', current_crawl_ID)
    current_crawl_end_date = cur.fetchone()
    current_crawl_end_date = current_crawl_end_date[0]
    
    # URL_Crawl table where Crawl_ID == current crawl
    sql = 'SELECT URL_Crawl_ID, Last_Updated, Last_Published, Date_of_HTML, Unique_URL_ID FROM URL_Crawl WHERE Crawl_ID=%s'
    current_crawl = pd.read_sql_query(sql %(current_crawl_ID), con=db)

    # num keeps track of number of rows read through from current_crawl (number of nodes in current crawl)
    num = 1

    # For every unique url in the current crawl
    for row in current_crawl.itertuples():
        
        # Calculate max date .......................................................
        if ((row.Last_Updated == None) | (row.Last_Updated == '')):
            last_updated = '0'
        else:
            last_updated = row.Last_Updated
        if ((row.Last_Published == None) | (row.Last_Published == '')):
            last_published = '0'
        else:
            last_published = row.Last_Published
        if ((row.Date_of_HTML == None) | (row.Date_of_HTML == '')):
            date_of_html = '0'
        else:
            date_of_html = row.Date_of_HTML
            
        if ((last_updated >= last_published) & (last_updated >= date_of_html)):
            max_date = last_updated
        elif ((last_published >= last_updated) & (last_published >= date_of_html)):
            max_date = last_published
        elif ((date_of_html >= last_updated) & (date_of_html >= last_published)):
            max_date = date_of_html
        # ..........................................................................
        
        # Set remaining variables from current_crawl dateframe
        url_crawl_id = row.URL_Crawl_ID
        unique_url_id = row.Unique_URL_ID

        # Initialize starting and end dates/statuses with None
        starting_date = None
        starting_date_status = None
        ending_date = None
        ending_date_status = None
        
        # URL_Crawl table up until (but not including) current crawl
        sql2 = 'SELECT URL_Crawl_ID, Last_Updated, Last_Published, Date_of_HTML, Unique_URL_ID FROM URL_Crawl WHERE Crawl_ID<%s'
        previous_crawls = pd.read_sql_query(sql2 %(current_crawl_ID), con=db)

        # If row's unique_url_id exists in previous crawls (not a new node)
        if (unique_url_id in (previous_crawls['Unique_URL_ID']).tolist()):
            
            # Situation B ...................................................
            
            # Finding most recent lifetime of existing node
            existing = previous_crawls[previous_crawls['Unique_URL_ID'] == unique_url_id]
            existing_url_crawl_ids = (existing.URL_Crawl_ID).tolist()
            
            existing_in_lifetime = pd.DataFrame()
            
            for i in existing_url_crawl_ids:
                sql3 = 'SELECT * FROM Lifetime WHERE URL_Crawl_ID=%d'
                exist_in_lt = pd.read_sql_query(sql3 %(i), con=db)
                existing_in_lifetime = existing_in_lifetime.append(exist_in_lt, ignore_index=True)
            
            most_recent_lifetime = existing_in_lifetime[existing_in_lifetime.Lifetime_ID == existing_in_lifetime.Lifetime_ID.max()]
                
            # Dates/statuses from most recent lifetime - convert to Strings
            most_recent_starting_date = ((most_recent_lifetime.Starting_Date).tolist())[0]
            most_recent_starting_date_status = ((most_recent_lifetime.Starting_Date_Status).tolist())[0]
            most_recent_ending_date = ((most_recent_lifetime.Ending_Date).tolist())[0]
            most_recent_ending_date_status = ((most_recent_lifetime.Ending_Date_Status).tolist())[0]
            most_recent_lifetimeID = ((most_recent_lifetime.Lifetime_ID).tolist())[0]
                
            if (max_date != '0'):
                if ((max_date >= current_crawl_begin_date) & (max_date <= current_crawl_end_date)):
                    # Situation B.2
                    ending_date = max_date
                    ending_date_status = "Exact"
                    cur.execute("""UPDATE Lifetime SET Ending_Date=?, Ending_Date_Status=? 
                                WHERE Lifetime_ID=?""", (ending_date, ending_date_status, most_recent_lifetimeID))
                    starting_date = max_date
                    starting_date_status = "Exact"
                    ending_date = None
                    ending_date_status = None
                    cur.execute("""INSERT INTO Lifetime VALUES (null, ?, ?, ?, ?, ?)
                                """, (starting_date, ending_date, starting_date_status, ending_date_status, url_crawl_id))
                elif ((max_date < current_crawl_begin_date) & (max_date > most_recent_starting_date)):
                    # Situation B.3
                    ending_date = max_date
                    ending_date_status = "Exact"
                    cur.execute("""UPDATE Lifetime SET Ending_Date=?, Ending_Date_Status=? 
                                WHERE Lifetime_ID=?""", (ending_date, ending_date_status, most_recent_lifetimeID))
                    starting_date = max_date
                    starting_date_status = "Exact"
                    ending_date = current_crawl_begin_date
                    ending_date_status = "Estimated"
                    cur.execute("""INSERT INTO Lifetime VALUES (null, ?, ?, ?, ?, ?)
                                """, (starting_date, ending_date, starting_date_status, ending_date_status, url_crawl_id))
                elif (max_date == most_recent_starting_date):
                    # Situation B.4
                    ending_date = current_crawl_begin_date
                    ending_date_status = "Estimated"
                    cur.execute("""UPDATE Lifetime SET Ending_Date=?, Ending_Date_Status=? 
                                WHERE Lifetime_ID=?""", (ending_date, ending_date_status, most_recent_lifetimeID))
                elif ((max_date > current_crawl_end_date) | (max_date < most_recent_starting_date)):
                    # Situation B.1
                    max_date = '0'
            if (max_date == '0'):
                # Situation B.5
                ending_date = current_crawl_begin_date
                ending_date_status = "Estimated"
                cur.execute("""UPDATE Lifetime SET Ending_Date=?, Ending_Date_Status=? 
                            WHERE Lifetime_ID=?""", (ending_date, ending_date_status, most_recent_lifetimeID))
                    
        # If row's unique_url_id is a new node (not seen in previous crawls)
        else:
            
            # Situation A ...................................................
            
            if (max_date != '0'):
                if ((max_date >= current_crawl_begin_date) & (max_date <= current_crawl_end_date)):
                    # Situation A.2
                    starting_date = max_date
                    starting_date_status = "Exact"
                elif (max_date < current_crawl_begin_date):
                    # Situation A.3
                    starting_date = max_date
                    starting_date_status = "Exact"
                    ending_date = current_crawl_begin_date
                    ending_date_status = "Estimated"
                elif (max_date > current_crawl_end_date):
                    # Situation A.1
                    max_date = '0'
            if (max_date == '0'):
                # Situation A.4
                starting_date = current_crawl_end_date
                starting_date_status = "Estimated"
        
            cur.execute("""INSERT INTO Lifetime VALUES (null, ?, ?, ?, ?, ?)
                        """, (starting_date, ending_date, starting_date_status, ending_date_status, url_crawl_id))
           

Can executemany be used in this capacity? If so, I don't know the appropriate syntax for executemany - I've tried a couple of things that haven't worked yet. The database is SQLite and the program is python based.

MAb2021
  • 127
  • 9
  • `executemany` operates on a sequence of data. Can you point out what that sequence should be? – Klaus D. Sep 09 '21 at 04:39
  • Using executemany with `UPDATE` -- https://stackoverflow.com/questions/33793952/using-executemany-to-update-entries-in-an-existing-sqlite3-database-using-pyt – Ian Wilson Sep 09 '21 at 04:45
  • My apologies, I added more of the code that shows more about the sequence, I believe - the process runs through each url_crawl_id in the dataset. – MAb2021 Sep 09 '21 at 04:46
  • @KlausD. I just updated it again, the initial piece isn't really where the issue is - it's this part where it iterates through all of the URL_IDs and lifetime_ids. – MAb2021 Sep 09 '21 at 04:58
  • It seems that autocommit is used so that each INSERT or UPDATE is a transaction of its own. With BEGIN and a COMMIT only after e. g. each 1000 changes things should be considerably faster. – Michael Butscher Sep 09 '21 at 06:30
  • Per @IanWilson's suggestion - I was able to timestamp the segments of code and it seems to be the 3 line block starting from "# URL_Crawl table up until (but not including) current crawl" that is causing the largest slowdown. This short bit results in: 0:00:02.022475 sec per ID. – MAb2021 Sep 09 '21 at 06:40

1 Answers1

1

Kind of hard to give an exact answer without fully understanding your code. I don't quite see where you are iterating over urls/ids/etc. You will want to make a list for updates and a list for inserts outside your loop and then accumulate the sequences of params in their corresponding list. Finally after the loop you will pass each list to executemany with the fixed SQL you want to execute.

This should give you an idea of how it would work with a loop/iteration.


    #...

    # These are each a list of tuples/lists
    # ie. [(param0, ..., paramN), ..., (param0, ..., paramN)]
    params_to_update = []
    params_to_insert = []

    # For every unique url in the current crawl
    for row in current_crawl.itertuples():

        #...

            if (max_date != '0'):
                if ((max_date >= current_crawl_begin_date) & (max_date <= current_crawl_end_date)):
                    # Situation B.2
                    ending_date = max_date
                    ending_date_status = "Exact"
                    params_to_update.append((ending_date, ending_date_status, most_recent_lifetimeID))
                    starting_date = max_date
                    starting_date_status = "Exact"
                    ending_date = None
                    ending_date_status = None
                    params_to_insert.append((starting_date, ending_date, starting_date_status, ending_date_status, url_crawl_id))
                elif ((max_date < current_crawl_begin_date) & (max_date > most_recent_starting_date)):
                    # Situation B.3
                    ending_date = max_date
                    ending_date_status = "Exact"
                    params_to_update.append((ending_date, ending_date_status, most_recent_lifetimeID))
                    starting_date = max_date
                    starting_date_status = "Exact"
                    ending_date = current_crawl_begin_date
                    ending_date_status = "Estimated"
                    params_to_insert.append((starting_date, ending_date, starting_date_status, ending_date_status, url_crawl_id))

    # After for loop is done.
    # Call UPDATE for each sequence of params in this list.
    UPDATE_SQL = """UPDATE Lifetime SET Ending_Date=?, Ending_Date_Status=? WHERE Lifetime_ID=?"""
    cur.executemany(UPDATE_SQL, params_to_update)
    # Call INSERT for each sequence of params in this list.
    INSERT_SQL = """INSERT INTO Lifetime VALUES (null, ?, ?, ?, ?, ?)"""
    cur.executemany(INSERT_SQL, params_to_insert)

stackoverflow.com: using-executemany-to-update-entries-in-an-existing-sqlite3-database-using-pyt

docs.python.org: python doc executemany example

Ian Wilson
  • 6,223
  • 1
  • 16
  • 24
  • Thank you for your response! I think I'm understanding it better now. I didn't want to post the whole code because it's cumbersome. But, I modified the initial post to include the whole thing. With this, then, if the code runs over multiple scenarios, should each scenario then be it's own update/insert statement after the iteration? – MAb2021 Sep 09 '21 at 05:22
  • @MeredithAbrams It seems like you would accumulate the sequences of params inside this loop -- `for row in current_crawl.itertuples():` . After that loop each unique statement would need to be executed separately. In your example it seems that the two inserts and two updates were the same (in the `if max_date ... elif` blocks) so I grouped them. You could do that for more blocks if the statements are the same. – Ian Wilson Sep 09 '21 at 05:36
  • It seems that in all your blocks your `INSERT`s are the same and your `UPDATE`s are the same. So you could just keep appending to the corresponding list and then pass it to executemany at the end. – Ian Wilson Sep 09 '21 at 05:39
  • I see, that's really helpful! Thank you! And this should, in theory, expedite the process, correct? It's currently averaging only about 30 records per min. – MAb2021 Sep 09 '21 at 05:43
  • In theory, but that is very slow but it may be affected by the number of rows in the table and the updates/inserts are causing an index to be updated on all the rows. At the very least you should manually profile like `start = datetime.now()` then later `print(datetime.now() - start)` to see how much time elapses at each step. – Ian Wilson Sep 09 '21 at 05:48
  • Ie. if a single insert takes 10 seconds because you have 10000000 rows in that table then executemany probably isn't going to make a big difference. Maybe it would do 31 records per minute. I'm completely making up numbers but you get the idea. The slowdown might not be what you think it is. – Ian Wilson Sep 09 '21 at 05:52
  • Oh no! The tables are extremely large in terms of number of rows. I spent a lot of time on the DB schema to improve efficiency there. I'll have to see where the slowdown is occurring! – MAb2021 Sep 09 '21 at 05:56
  • 1
    thank you for your time on this! I was able to move the segment out of the loop and it runs far faster now. – MAb2021 Sep 09 '21 at 07:09