-1

I've written a simple python script to search for a log file in a folder (which has approx. 4 million files) and read the file. Currently, the average time taken for the entire operation is 20 seconds. I was wondering if there is a way to get the response faster.

Below is my script

import re
import os
import timeit
from datetime import date

log_path = "D:\\Logs Folder\\"
rx_file_name = r"[0-9a-z]{8}-[0-9a-z]{4}-[0-9a-z]{4}-[0-9a-z]{4}-[0-9a-z]{12}"
log_search_script = True
today = str(date.today())

while log_search_script:

    try:

        log_search = input("Enter image file name: ")

        file_name = re.search(rx_file_name, log_search).group()

        log_file_name = str(file_name) + ".log"

        print(f"\nLooking for log file '{log_file_name}'...\n")
        pass

    except:
        print("\n ***** Invalid input. Try again! ***** \n")
        continue

    start = timeit.default_timer()

    if log_file_name in os.listdir(log_path):

        log_file = open(log_path + "\\" + log_file_name, 'r', encoding="utf8")

        print('\n' + "--------------------------------------------------------" + '\n')

        print(log_file.read())
        log_file.close()

        print('\n' + "--------------------------------------------------------" + '\n')

        print("Time Taken: " + str(timeit.default_timer() - start) + " seconds")

        print('\n' + "--------------------------------------------------------" + '\n')

    else:
        print("Log File Not Found")

    search_again = input('\nDo you want to search for another log ("y" / "n") ?').lower()
    if search_again[0] == 'y':
        print("======================================================\n\n")
        continue

    else:
        log_search_script = False

  • 1
    To be clear, it's a single flat folder with 4 million files? They aren't in any subfolders? I don't understand the reason to "search" for the file. There can only be at most one file with the given name, and there is only one exact folder where you want to check for it; so there is no point in looking at every file name to see if it's present. Instead, just `try` to open the file, and catch and handle the exception when it isn't present. – Karl Knechtel Jan 26 '22 at 09:56
  • Don't use os.listdir at all. It creates list with 4 mln. elements and takes time, but more importantly - it's antipattern. Just try to open the file and handle the error if file is missing. – buran Jan 26 '22 at 09:57
  • Unless you wanted to look for files where `log_file_name` is only *part of* the file name. However, your code doesn't actually do this. – Karl Knechtel Jan 26 '22 at 09:57
  • @KarlKnechtel yes. It's just 1 folder with 4 mil files and no subfolders. As suggested I just opened the file without using os.listdir and it worked like a charm. Thanks guys! – Amaal Anoos Jan 26 '22 at 10:16

2 Answers2

0

Your problem is the line:

if log_file_name in os.listdir(log_path):

This has two problems:

  1. os.listdir will create a huge list which can take a lot of time (and space...).
  2. the ... in ... part will now go over that huge list linearly and search for the file.

Instead, let your OS do the hard work and "ask for forgivness, not permission". Just assume the file is there and try to open it. If it is not actually there - an error will be raised, which we will catch:

try:
    with open(log_path + "\\" + log_file_name, 'r', encoding="utf8") as file:
        print(log_file.read())
except FileNotFoundError:
    print("Log File Not Found")
Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
0

You can use glob.

import glob
print(glob.glob(directory_path))
cnp
  • 339
  • 2
  • 11
  • How will that help? It still scans the folder... There is no need for that – Tomerikoo Jan 26 '22 at 11:18
  • With glob, you get all files. Afterwards, you can run it parallel. Try with https://www.tutorialspoint.com/python/python_multithreading.htm If you have more time, you can work with openMP. It can improve the time. – cnp Jan 26 '22 at 12:19