17

I'm not talking about specific line numbers because i'm reading multiple files with the same format but vary in length.
Say i have this text file:

Something here...  
... ... ...   
Start                      #I want this block of text 
a b c d e f g  
h i j k l m n  
End                        #until this line of the file
something here...  
... ... ...  

I hope you know what i mean. i was thinking of iterating through the file then search using regular expression to find the line number of "Start" and "End" then use linecache to read from Start line to End line. But how to get the line number? what function can i use?

BPm
  • 2,924
  • 11
  • 33
  • 51
  • This question is very similar to this one http://stackoverflow.com/questions/7098530/repeatedly-extract-a-line-between-two-delimiters-in-a-text-file-python – salomonvh Dec 02 '15 at 14:25
  • It's also similar to this one https://stackoverflow.com/a/9222120/2641825 which has a nice answer using a regular expression. The call would be `re.findall(r'Start(.*?)End',data,re.DOTALL)` similar to @pyInTheSky's answer below. – Paul Rougieux Feb 07 '19 at 17:57

4 Answers4

37

If you simply want the block of text between Start and End, you can do something simple like:

with open('test.txt') as input_data:
    # Skips text before the beginning of the interesting block:
    for line in input_data:
        if line.strip() == 'Start':  # Or whatever test is needed
            break
    # Reads text until the end of the block:
    for line in input_data:  # This keeps reading the file
        if line.strip() == 'End':
            break
        print line  # Line is extracted (or block_of_lines.append(line), etc.)

In fact, you do not need to manipulate line numbers in order to read the data between the Start and End markers.

The logic ("read until…") is repeated in both blocks, but it is quite clear and efficient (other methods typically involve checking some state [before block/within block/end of block reached], which incurs a time penalty).

Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
  • Which means after the break statement, the next for loop reads the lines from where the first for loop left the reading. – jax Nov 01 '18 at 16:03
  • What about mutliple occurences of the blocks tith the same open and close text? – Heinz Mar 28 '20 at 17:23
  • 1
    That's a good question. It's not as simple as adding a loop inside the `with` statement: the difficulty is to stop iterating when the file is fully read, while combining this with the marker detection logic. This deserves a separate question. – Eric O. Lebigot Mar 29 '20 at 18:49
  • Shouldn't the first instance of 'break' not be there? – gannex Oct 01 '21 at 23:53
  • What else do you propose? Without a break, the first loop would read the whole file instead, and nothing would be printed. – Eric O. Lebigot Oct 03 '21 at 15:48
5

Here's something that will work:

data_file = open("test.txt")
block = ""
found = False

for line in data_file:
    if found:
        block += line
        if line.strip() == "End": break
    else:
        if line.strip() == "Start":
            found = True
            block = "Start"

data_file.close()
Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
orlp
  • 112,504
  • 36
  • 218
  • 315
  • 4
    @BPm: This is an example of a "finite state machine" (http://en.wikipedia.org/wiki/Finite_state_machine) : the machine starts in a state "Block not yet found" (found==False), keeps running in a state "Within the block" (found==True) and in this case stops when "End" is found. They can be a little inefficient (here, `found` has to be checked for each line in the block), but state machines often allow one to cleanly express the logic of more complex algorithms. – Eric O. Lebigot Sep 27 '11 at 08:35
  • +1, because this is a good example of the completely valid state machine approach. – Eric O. Lebigot Sep 27 '11 at 08:40
  • 1
    Thanks for the "finite state machine" reference! – BPm Nov 30 '12 at 23:54
3

You can use a regex pretty easily. You can make it more robust as needed, below is a simple example.

>>> import re
>>> START = "some"
>>> END = "Hello"
>>> test = "this is some\nsample text\nthat has the\nwords Hello World\n"
>>> m = re.compile(r'%s.*?%s' % (START,END), re.S)
>>> m.search(test).group(0)
'some\nsample text\nthat has the\nwords Hello'
Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
pyInTheSky
  • 1,459
  • 1
  • 9
  • 24
  • 1
    +1: Very good idea: this is compact, and might be very efficient, since the `re` module is fast. The START and END tags should be forced to be on a line *by themselves*, though, in your regular expression (`^…$`). – Eric O. Lebigot Sep 27 '11 at 08:37
  • Thanks : ) .. I don't think you can use ^ || $ when you use the re.S spec since it includes newline chars, think you'd need to explicitly say '%s\n.*?%s\n' – pyInTheSky Sep 27 '11 at 14:02
  • 1
    You can certainly use ^…$ in this case, by simply adding the re.MULTILINE flag (http://docs.python.org/dev/library/re.html#module-contents). – Eric O. Lebigot Sep 27 '11 at 14:41
  • you are correct. for some reason, I had thought that the .S conflicted with the .M when using ^/$ but it does not, so thank you for the comment – pyInTheSky Sep 27 '11 at 15:14
1

This should be a start for you:

started = False
collected_lines = []
with open(path, "r") as fp:
     for i, line in enumerate(fp.readlines()):
         if line.rstrip() == "Start": 
             started = True
             print "started at line", i # counts from zero !
             continue
          if started and line.rstrip()=="End":
             print "end at line", i
             break
          # process line 
          collected_lines.append(line.rstrip())

The enumerate generator takes a generator and enumerates the iterations. Eg.

  print list(enumerate("a b c".split()))

prints

   [ (0, "a"), (1,"b"), (2, "c") ]

UPDATE:

the poster asked for using a regex to match lines like "===" and "======":

import re
print re.match("^=+$", "===")     is not None
print re.match("^=+$", "======")  is not None
print re.match("^=+$", "=")       is not None
print re.match("^=+$", "=abc")    is not None
print re.match("^=+$", "abc=")    is not None
rocksportrocker
  • 7,251
  • 2
  • 31
  • 48