1

I have written a code that reads a file, finds if a line has the word table_begin and then counts the number of lines until the line with the word table_end.

Here is my code -

for line in read_file:
    if "table_begin" in line:
        k=read_file.index(line)
    if 'table_end' in line:
        k1=read_file.index(line)
        break

count=k1-k
if count<10:
    q.write(file)

I have to run it on ~15K files so, since its a bit slow (~1 file/sec), I was wondering if I am doing something inefficient. I was not able to find myself, so any help would be great!

3 Answers3

8

When you do read_file.index(line), you are scanning through the entire list of lines, just to get the index of the line you're already on. This is likely what's slowing you down. Instead, use enumerate() to keep track of the line number as you go:

for i, line in enumerate(read_file):
    if "table_begin" in line:
        k = i
    if "table_end" in line:
        k1 = i
        break
Community
  • 1
  • 1
Claudiu
  • 224,032
  • 165
  • 485
  • 680
1

You are always checking for both strings in the line. In addition, index is heavy as you're seeking the file, not the line. Using "in" or "find" will be quicker, as will only checking for table_begin until you've found it, and table_end after you've seen table_begin. If you aren't positive each file has table_begin and table_end in that order (and only one of each) you may need some tweaking/checks here (maybe pairing your begin/end into tuples?)

EDIT: Incorporated enumerate and switched from a while to a for loop, allowing some complexity to be removed.

def find_lines(filename):
    bookends = ["table_begin", "table_end"]
    lines = open(filename).readlines()
    for bookend in bookends:
        for ind, line in enumerate(lines):
            if bookend in line:
                yield ind
                break

for line in find_lines(r"myfile.txt"):
    print line
print "done"
1

Clearly, you obtain read_file by f.readlines(), which is a bad idea, because you read the all file.

You can win a lot of time by :

  • reading file line by line :
  • searching one keyword at each time.
  • stopping after 10 lines.

    with open('test.txt') as read_file:
        counter=0
        for line in read_file:
            if "table_begin" in line : break
        for line in read_file:
            counter+=1
            if "table_end" in line or counter>=10 : break # if  "begin" => "end" ...
        if counter < 10 : q.write(file)
    
B. M.
  • 18,243
  • 2
  • 35
  • 54