0

I have a file, like this:

<prop type="ltattr-match">1-1</prop>
id =>3</prop>
<tuv xml:lang="en">
<seg> He is not a good man </seg>

And what I want is to detect the third line before the line He is not a good man , i.e (id =>3). The file is big. what I can do

DSM
  • 342,061
  • 65
  • 592
  • 494
sss
  • 1,307
  • 2
  • 10
  • 16

4 Answers4

2

I suggest using a double ended queue with a maximum length: this way, only the required amount of "backlog" is stored and you don't have to fiddle around with slices manually. We don't need the "double-ended-ness", but the normal Queue class blocks if the queue is full.

import collections
dq = collections.deque([], 3)        # create an empty queue

with open("mybigfile.txt") as file:
    for line in file.readlines():
        if line.startswith('<seg>'):
            return dq[0]             # or add to list
        dq.append(line)              # save the line, if already 3 lines stored,
                                     # discard oldest line.
Jasper
  • 3,939
  • 1
  • 18
  • 35
1

Read each line in sequence, remembering only the last 3 read at any point.

Something like:

# Assume f is a file object open to your file
last3 = []
last3.append( f.readline() )
last3.append( f.readline() )
last3.append( f.readline() )
while ( True ):
    line = f.readline()
    if (line satisfies condition):
        break
    last3 = last3[1:]+[line]
# At this point last3[0] is 3 lines before the matching line

You'll need to modify this to handle files w/ < 3 lines, or if no line matches your condition.

Scott Hunter
  • 48,888
  • 12
  • 60
  • 101
1
with open("mybigfile.txt") as file:
    lines = file.readlines()

for idx, line in enumerate(lines):
    if line.startswith("<seg>"):
        line_to_detect = lines[idx-3]
        #use idx-2 if you want the _second_ line before this one, 
        #ex `id =>3</prop>`
        print "This line was detected:"
        print line_to_detect

Result:

This line was detected:
<prop type="ltattr-match">1-1</prop>

As we previously discussed in chat, this method can be memory intensive for very large files. But 100 pages isn't very large, so this should be fine.

Community
  • 1
  • 1
Kevin
  • 74,910
  • 12
  • 133
  • 166
0
file = "path/to/the/file"
f = open(file, "r")
lines = f.readlines()
f.close()
i = 0
for line in lines:
    if "<seg> He is not a good man </seg>" in line:
       print(lines[i]) #Print the prvious line
    else
        i += 1

If you need the second line before just change to print(lines[i-1])

llrs
  • 3,308
  • 35
  • 68
  • This will have `line` be empty. It certainly won't be the third line, much less "the third line before the line He is not a good man". – DSM Apr 25 '14 at 15:46
  • And OP isn't looking for the 3rd line, but the 3rd line BEFORE one with specific content. – Scott Hunter Apr 25 '14 at 15:47
  • @DSM I thought that you don't need to use the variable you loop.Now it will check the next line and find if it happens more than once. – llrs Apr 25 '14 at 15:56