1

I'm having some trouble optimizing this part of code. It works, but seems unnecessary slow. The function searches after a searchString in a file starting on line line_nr and returns the line number for first hit.

import linecache
def searchStr(fileName, searchString, line_nr = 1, linesInFile):
# The above string is the input to this function 
# line_nr is needed to search after certain lines.
# linesInFile is total number of lines in the file.

    while line_nr < linesInFile + 1:
        line = linecache.getline(fileName, line_nr)
        has_match = line.find(searchString)
        if has_match >= 0:
            return line_nr
            break
        line_nr += 1

I've tried something along these lines, but never managed to implement the "start on a certain line number"-input.

Edit: The usecase. I'm post processing analysis files containing text and numbers that are split into different sections with headers. The headers on line_nr are used to break out chunks of the data for further processing.

Example of call:

startOnLine = searchStr(fileName, 'Header 1', 1, 10000000): endOnLine = searchStr(fileName, 'Header 2', startOnLine, 10000000):

Community
  • 1
  • 1
user2987193
  • 133
  • 1
  • 1
  • 7
  • What is "linecache.getline()" ? – bruno desthuilliers Nov 13 '13 at 10:37
  • Sorry forgot to mention 'import linecache' = Random access to text lines – user2987193 Nov 13 '13 at 10:38
  • try to use regexp, it's quite faster than `str.find()` – yedpodtrzitko Nov 13 '13 at 10:41
  • 2
    you should explain the usecase... what are you trying to accomplish? – root Nov 13 '13 at 10:44
  • @yedpodtrzitko: measuring both `str.find` and `re.search` with `timeit`, I get 0.7773511409759521 for `re.search` and 0.15282893180847168 for `str.find` (Python 2.7.3 on ubuntu). – bruno desthuilliers Nov 13 '13 at 10:47
  • @brunodesthuilliers can you show me the code and testing data? – yedpodtrzitko Nov 13 '13 at 10:55
  • @yedpodtrzitko: http://pastie.org/8477197 – bruno desthuilliers Nov 13 '13 at 12:41
  • Sorry, but that's really silly example. Of course on single line `str.find` will be much faster, but in the case you've hunderds - thousands of lines, regex will be faster (+ using `re.compile`...) – yedpodtrzitko Nov 13 '13 at 12:47
  • @yedpodtrzitko: please re-read the question - we're talking about searching on a single line and returning the line number ;). Also, the `re` module do compile patterns automagically and keep a (rather large) cache of last used compiled patterns. Since the pattern is dynamic and only known at runtime, I don't see the point of manually calling `re.compile` here. – bruno desthuilliers Nov 13 '13 at 14:37
  • @brunodesthuilliers I can't agree about that 'single line'. You're looking for single line, yes. But you're looking for that single line in a whole file and that means you're looking thru a lot of lines. I tried to read it over and over again, but this is my final conclusion .) Also about `re.compile`: see this gist: https://gist.github.com/yedpodtrzitko/7451361 – yedpodtrzitko Nov 13 '13 at 15:54

2 Answers2

1

Why don't you start with simplest possible implementation ?

def search_file(filename, target, start_at = 0):
    with open(filename) as infile:
        for line_no, line in enumerate(infile):
            if line_no < start_at:
                continue
            if line.find(target) >= 0:
                return line_no
    return None
bruno desthuilliers
  • 75,974
  • 6
  • 88
  • 118
  • I'm so new to python that i don't know what the simples way is :) But this seems much simpler and is slightly faster. Thank you! – user2987193 Nov 13 '13 at 11:02
  • Should have said much faster! Time went from 11.597 to 4.341 – user2987193 Nov 13 '13 at 11:13
  • @user2987193: feel free to accept my answer if it solves your problem ;). More seriously: depending on you concrete use case (file format if any, why "start from a given line", what is all this used for and in which context) there might be quite a few better solutions. – bruno desthuilliers Nov 13 '13 at 12:44
  • FEM-analysis post processing. The start att a certain line is to avoid reading the same data twice. Much of the outputdata varies from time to time. – user2987193 Nov 13 '13 at 13:25
  • Ok I don't know zilch about FEM so I can't help more, but possibly someone will chime in... – bruno desthuilliers Nov 13 '13 at 14:59
0

I guess your file is like: Header1 data11 data12 data13.. name1 value1 value2 value3... ... ... Header2 data21 data22 data23.. nameN valueN1 valueN2 valueN3.. ... Does the 'Header' string contains any constant formats(i.e: all start with '#' or sth). If so, you can read the line directly, judge if the line contains this format (i.e: if line[0]=='#') and write different code for different kinds of lines(difination line and data line in your example).

Record class:

class Record:
   def __init__(self):
       self.data={}
       self.header={}
   def set_header(self, line):
       ...
   def add_data(self, line):
       ...

iterate part:

def parse(p_file):
   record = None
   for line in p_file:
      if line[0] == "#":
         if record : yield record
         else:
           record = Record()
           record.set_header(line)
      else:
         record.add_data(line)
   yield record

main func:

data_file = open(...)
for rec in parse(data_file):
    ...
Luyi Tian
  • 371
  • 1
  • 12
  • Yes, it similar to that, the only problem is that it repeats Header1 several times in different context. This function is used several times in different contexts. – user2987193 Nov 13 '13 at 13:24
  • I have update my answer, hope that would help. you could create an Record class and use "yield" to return a record. – Luyi Tian Nov 15 '13 at 17:14