Python Script Optimization searching for SQL in many files

Question

I am trying to find if the insert statement in labels.sql is needed anymore so if the textid in any insert statement for example Insert into EPORTAL_DATA.MLNGTXT (MODULEID,TEXTID,LANGUAGEID,MLTEXT,TEXTTYPE) values ('COMMON','SearchButton','en-ca','Search','LABEL'); exist in any of the other files than i will keep the insert statement otherwise i will delete it.

import os
import re
import shutil

current_dir_location = os.getcwd()
labels_script_location = os.path.join(current_dir_location, 'SQL-scripts/labels.sql')
copy_location = os.path.join(current_dir_location, 'SQL-scripts/lablesCopy.sql')

# Create a copy of the labels.sql file
shutil.copy2(labels_script_location, copy_location)

#Read insert statements from copy file
with open(copy_location, 'r',encoding='UTF-8') as file:
    insert_statements_from_labels = file.readlines()

# Compile the regular expression pattern
text_id_pattern = re.compile(r"'[^']+'\s*,\s*'([^']+)'")  # Regular expression for finding the values

def search_references(references_dir, text_id):
    regex = re.compile(text_id)
    for root, dirs, files in os.walk(references_dir):
        if 'node_modules' in dirs:
            dirs.remove('node_modules')
        if 'SQL-scripts' in dirs:
            dirs.remove('SQL-scripts')

        for file_name in files:
            file_path = os.path.join(root, file_name)
            try:
                with open(file_path, 'r', encoding='utf-8') as file:
                    content = file.read()
                    if regex.search(content):
                       return True # stop searching as soon as a match is found
            except (UnicodeDecodeError,PermissionError): 
                # some of the files are not encoded as UTF-8
                # some files can not be read like read.lock
                continue

    return False

def get_text_id(insert_statement):    
    match = text_id_pattern.search(insert_statement)
    if match:
        text_id = match.group(1)  # Retrieve the second captured group
    else:
        text_id = None
    return text_id


def search_decide():
    lines_deleted = 0
    used_insert_statements = []
    unused_insert_statement = {}
    for index, insert_statement in enumerate(insert_statements_from_labels):
        text_id = get_text_id(insert_statement)
        if text_id:
            is_used = search_references(current_dir_location, text_id)
            if is_used:
                used_insert_statements.append(insert_statement)
            else:
                unused_insert_statement[index] = insert_statement
                lines_deleted += 1

    with open(copy_location, 'w', encoding='utf-8') as file:
        file.writelines(used_insert_statements)

    for index, statement in unused_insert_statement.items():
        print(f"Unused Insert Statements are : \n{index}: {statement}\n")
    print("Script Ran Successfully")
    print(f"{lines_deleted} lines were deleted from labels.sql")

if __name__ == "__main__":
  find_decide()

Is there any possibility of optimization or performance improvement in the code i am trying to read a 2000 line labels.sql file and check for references.Right now its taking around 10 min to finish the script.

These are some of the details i found using profiler and snakeviz

and the table

Have you profiled the code? You could use snakeviz to get an idea of where to look for performance improvement. If you could share this with us, that would help us with your problem. — Dumbo, Jul 06 '23 at 13:52

ShadowRanger · Accepted Answer · 2023-07-20T15:40:28.010

Seems like your problem is that you're searching the same, presumably large, directory structure over and over, once for each search term. If you've got 2000 things to search for, based on your ncalls for open, you probably have 300-odd files under that tree, and you're reopening and rereading each of them 2000 times.

It would almost certainly be faster to only open and read each file once, and search it for all the search terms in sequence (or better, in a single pass) before moving to the next file. Or just search a smaller number of files in the first place if that's possible.

File I/O is expensive, try to do less of it.

Python Script Optimization searching for SQL in many files

1 Answers1