0

I tried to construct my own string.find() method/function in Python. I did this for a computer science class I'm in.

Basically, this program opens a text file, gets a user input on this the text they want to search for in the file, and outputs the line number on which the string resides, or outputs a 'not found' if the string doesn't exist in the file.

However, this takes about 34 seconds to complete 250,000 lines of XML.

Where is the bottleneck in my code? I made this in C# and C++ as well, and this runs in about 0.3 seconds for 4.5 million lines. I also performed this same search using the built-in string.find() from Python, and this takes around 4 seconds for 250,000 lines of XML. So, I'm trying to understand why my version is so slow. https://github.com/zach323/Python/blob/master/XML_Finder.py

fhand = open('C:\\Users\\User\\filename')
import time
str  = input('Enter string you would like to locate: ') #string to be located in file
start = time.time()
delta_time = 0

def find(str):
    time.sleep(0.01)
    found_str ='' #initialize placeholder for found string
    next_index = 0 #index for comparison checking
    line_count = 1
    for line in fhand: #each line in file
        line_count = line_count +1
        for letter in line: #each letter in line
            if letter == str[next_index]: #compare current letter index to beginning index of string you want to find

                found_str += letter #if a match, concatenate to string placeholder

                #print(found_str) #print for visualization of inline search per iteration
                next_index = next_index + 1


                if found_str == str: #if complete match is found, break out of loop.



                        print('Result is: ', found_str, ' on line %s '%(line_count))
                    print (line)
                    return found_str #return string to function caller
                    break
            else:
                #if a match was found but the next_index match was False, reset the indexes and try again.
                next_index=0 # reset indext back to zero
                found_str = '' #reset string back to empty

        if found_str == str:

            print(line)

if str != "":
    result = find(str)
    delta_time = time.time() - start
    print(result)
    print('Seconds elapsed: ', delta_time)  
else:
    print('sorry, empty string')
double-beep
  • 5,031
  • 17
  • 33
  • 41
TimmyTooTough
  • 101
  • 1
  • 1
  • 6
  • You can try regex per line instead of a nested for loop. – gogasca Feb 07 '18 at 20:04
  • I understand I can use other options to perform the same task; however, I am interested in why my version performs roughly 8 times slower than Python's string.find() method. – TimmyTooTough Feb 07 '18 at 20:31
  • 1. Why a `break` immediately after a `return`? 2. ... why all those empty lines? – Jongware Feb 07 '18 at 20:35
  • Because if the string is halfway in the file, why continue searching for it? Makes sense to break early if the program finds the first instance of what you're looking for, versus searching through another million lines. – TimmyTooTough Feb 07 '18 at 20:44
  • Run a profiler, this will help you find out details about where is the delay. https://stackoverflow.com/questions/3927628/how-can-i-profile-python-code-line-by-line – gogasca Feb 08 '18 at 05:12
  • But the program can never reach that `break`. The function does a `return` before it ever gets there. – Jongware Feb 08 '18 at 17:01

2 Answers2

0

Try this:

with open(filename) as f:
    for row in f:
        if string in row:
            print(row)
Alek Westover
  • 244
  • 1
  • 9
  • 1
    I'm interested in why my version is slower than the Python string.find() method. Not necessarily how to find a string within a file. – TimmyTooTough Feb 07 '18 at 20:33
0

The following code runs on a text file of size comparable to the size of your file. Your code doesn't run too slowly on my computer.

fhand = open('test3.txt')

import time
string = input('Enter string you would like to locate: ') #string to be located in file
start = time.time()
delta_time = 0


def find(string):
    next_index_to_match = 0 
    sl = len(string)
    ct = 0

    for line in fhand: #each line in file
        ct += 1
        for letter in line: #each letter in line
            if letter == string[next_index_to_match]: #compare current letter index to beginning index of string you want to find
                # print(line)
                next_index_to_match += 1

                if sl == next_index_to_match: #if complete match is found, break out of loop.
                    print('Result is: ', string, ' on line %s '%(ct))
                    print (line)
                    return True

            else:
                #if a match was found but the next_index match was False, reset the indexes and try again.
                next_index_to_match=0 # reset indext back to zero
    return False

if string != "":   
    find(string)
    delta_time = time.time() - start
    print('Seconds elapsed: ', delta_time)  
else:
    print('sorry, empty string')
Alek Westover
  • 244
  • 1
  • 9