How do I optimize regex search in python?

Question

################################################################
#
# Reddit sentiment script for comments in /r/wallstreetbets
# Loads comments from db, scans comments using word list,
# creates sentiment score and writes results to database
#
################################################################

import sqlite3
import re
from datetime import datetime


# Load key search words from text file db
def read_words_from_file(ftext_file):
    load_words = ()
    f = open(ftext_file, 'r')
    # for item in f:
    #    load_words.append(item.lower().strip())
    load_words = tuple([x.lower().strip() for x in f])
    return(load_words)


# Search key words against comments
# Gives +/- for positive or negative sentiment words
def word_search(db_list, fpos_flist, fneg_flist):
    pos_count = 0
    neg_count = 0
    fdb_results_list = []
    total_lines_scanned = 0
    print("Starting word search...")

    # 1st for loop is comment data
    # 2nd for loop is key words
    for comment in db_list:
        total_lines_scanned = total_lines_scanned + 1
        for pos_item in fpos_flist:
            word_search = re.findall(r"\b" + pos_item + r"\b", comment[0])
            pos_count = pos_count + len(word_search)

        for neg_item in fneg_flist:
            word_search = re.findall(r"\b" + neg_item + r"\b", comment[0])
            neg_count = neg_count + len(word_search)

        # Determine pos/neg sentiment score based on frequency in comment
        if pos_count > neg_count:
            pos_count = pos_count / (pos_count+neg_count)
            neg_count = 0
        elif pos_count < neg_count:
            neg_count = neg_count / (pos_count+neg_count)
            pos_count = 0
        elif pos_count == neg_count:
            pos_count = 0
            neg_count = 0

        if pos_count > 0 or neg_count > 0:
            fdb_results_list.append([pos_count, neg_count, comment[1]])
        if total_lines_scanned % 100000 == 0:
            print("Lines counted so far:", total_lines_scanned)
        pos_count = 0
        neg_count = 0

    print("Word search complete.")
    return(fdb_results_list)


# Write results to new DB. Deletes odd db.
# pos = item[0], neg = item[1], timestamp = item[2]
def write_db_results(write_db_list):
    print("Writing results to database...")
    conn = sqlite3.connect('testdb.sqlite', timeout=30)
    cur = conn.cursor()

    cur.executescript('''DROP TABLE IF EXISTS redditresultstable
    ''')

    cur.executescript('''
    CREATE TABLE redditresultstable (
        id INTEGER NOT NULL PRIMARY KEY UNIQUE,
        pos_count INTEGER,
        neg_count INTEGER,
        timestamp TEXT
    );
    ''')
    for item in write_db_list:
        cur.execute('''INSERT INTO redditresultstable (pos_count, neg_count, timestamp)
                    VALUES (?, ?, ?)''', (item[0], item[1], item[2]))

    conn.commit()
    conn.close()
    print("Writing results to database complete.")


# Load comments item[2] and timestamp item[4] from db
def load_db_comments():
    print("Loading database...")
    conn = sqlite3.connect('redditusertrack.sqlite')
    cur = conn.cursor()
    cur.execute('SELECT * FROM redditcomments')
    row_db = cur.fetchall()
    conn.close()
    print("Loading complete.")
    db_list = ()

    db_list = tuple([(item[2].lower(), item[4]) for item in row_db])
    return db_list


# Main Program Starts Here
print(datetime.now())

db_list = load_db_comments()

pos_word_list = read_words_from_file("simple_positive_words.txt")
neg_word_list = read_words_from_file("simple_negative_words.txt")

db_results_list = word_search(db_list, pos_word_list, neg_word_list)
db_results_list = tuple(db_results_list)

write_db_results(db_results_list)

print(datetime.now())

This script loads 1.3 million comments into memory from SQLite and then scans 147 keywords against each comment to then calculate a sentiment score. ~ 191 million iterations.

Execution takes 5 minutes and 32 second

I changed most of the variables to tuples (from lists) and used list comprehension instead of For Loops (for appending). This improved execution by about 5% when compared to the script when only using lists & For Loops to append. 5% could be a margin of error since my method of measuring may not be accurate.

Stackoverflow and other resources seemed to suggest that using tuples was faster for this type of iteration even though some posters provided evidence saying that in some situations lists were faster.

Is this code optimized correctly for using tuples and list comprehension?

edit: thank you all for the suggestions/comments. lots of work to do. I implemented @YuriyP's suggestion and the runtime went from 5+ minutes to 26 seconds. The issue was with the regex For Loop search function.

updated code in the image attached. I removed the red crossed-out code and updated it with green.

Questions seeking for general review and critique of working code should be directed to [CodeReview.SE](https://codereview.stackexchange.com/). — Brian61354270, Jun 06 '21 at 03:54
Using tuples instead of lists and list comprehensions instead of for loops will not make your code faster — juanpa.arrivillaga, Jun 06 '21 at 03:57
`tuple()` will consume a generator expression, you don't have to make a list first - `>>> tuple(f'{x}{i}' for i,x in enumerate('asdf;lkj'))` — wwii, Jun 06 '21 at 03:57
Did you profile your solution to see where the bottlenecks are? - https://docs.python.org/3/library/profile.html — wwii, Jun 06 '21 at 03:59
You *make* a regular expression pattern for each positive/negative word and you do this for every comment. those patterns never change, they should be made once; stored and used. You might even be able to make a single pattern that will find any of those words. [https://stackoverflow.com/questions/24178290/python-regex-check-if-string-contains-any-of-words](https://stackoverflow.com/questions/24178290/python-regex-check-if-string-contains-any-of-words) — wwii, Jun 06 '21 at 04:06
If the comments have less than 147 words, it would be interesting to see if it is faster to make a pattern from the comment and search a string of the keywords. — wwii, Jun 06 '21 at 04:12
Short of profiling, you should also `print(datetime.now())` before `write_db_results()`. Doing 1.3 million individual insert statements may be the issue. That's gonna be a big log. — user212514, Jun 06 '21 at 04:15
@juanpa.arrivillaga Here is why I thought using tuples was faster [why-is-tuple-faster-than-list-in-python](https://stackoverflow.com/questions/3340539/why-is-tuple-faster-than-list-in-python). wwii - I will check out those resources - thank you. user2125142 great point. i can remove the hard drive variability from the calculation. — Justin McDonald, Jun 06 '21 at 04:15

YuriyP · Answer 1 · 2021-06-06T04:31:08.727

1

Use Regular expression to get total positive and negative word count from commaent instead of making O(N+M) requests for each positive and Negative word instead you will go for O(1).

Example:


    pos_word_list = read_words_from_file("simple_positive_words.txt")
    db_list = load_db_comments()
    posWordsEx = "|".join(pos_word_list)
    pos_words = 0
    for comment in db_list:
        words=re.findall(posWordsEx, comment)
        pos_words+=len(words)

edited Jun 06 '21 at 04:31

answered Jun 06 '21 at 04:16

YuriyP

11
4

This code is half what looks like Javascript and not suitable for the OP's question in python – Matthew Barlowe Jun 06 '21 at 04:17
My bad forget to check the way I edited it. – YuriyP Jun 06 '21 at 04:23
`",".join(pos_word_list)` - what will that pattern match? Maybe `"|".join(pos_word_list)`? with some word boundaries added in? – wwii Jun 06 '21 at 04:29
My bad again forget to "|" instead of coma – YuriyP Jun 06 '21 at 04:31
@YuriyP holy sh1t! Went from a 5+ minute runtime to 26 seconds. That is an insane improvement! I removed the For Loops for each individual keyword search to what you suggested for posWordsEx and negWordsEx. So in regex you can pass a string of words separated by | and search those words in that string almost like a For Loop? – Justin McDonald Jun 06 '21 at 05:03
Actually, made some updates to the code. posWordsEx = rf'\b(?:{"|".join(pos_word_list)})\b' This seems to be the right fit for what I was trying to search. There were some issues with the word boundaries in your original post. [match-a-whole-word-in-a-string-using-dynamic-regex](https://stackoverflow.com/questions/29996079/match-a-whole-word-in-a-string-using-dynamic-regex) – Justin McDonald Jun 06 '21 at 06:47

score 0 · Answer 2 · answered Jun 06 '21 at 04:13

Yo should prolly use cProfile to profile your code better measurements. And yes, tuples are tiny bit faster than list because you need 2 blocks of memory for the list and 1 for tuples. And, list comprehensions are also tiny bit faster but only if you have very simple expressions in your loops.

for total_lines_scanned, comment in enumerate(db_list, start=1):

tuple((item[2].lower(), item[4]) for item in row_db) -> you don't need list comprehensions here.

There are couple of things, you can do in your code like use enumerator for instead of bare for loop, and replace your comprehensions with simple generators.

I think profiling your code and trying to make it faster for such simple logic might not offer any advantages. Also, instead of using datetime use something like a profile decorator that will latch into your function and give you report of your calls.



import cProfile, pstats, io



def profile(fnc):
    
    """A decorator that uses cProfile to profile a function"""
    
    def inner(*args, **kwargs):
        
        pr = cProfile.Profile()
        pr.enable()
        retval = fnc(*args, **kwargs)
        pr.disable()
        s = io.StringIO()
        sortby = 'cumulative'
        ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
        ps.print_stats()
        print(s.getvalue())
        return retval

    return inner

Good luck!

How do I optimize regex search in python?

2 Answers2

Example: