1

I am quite new to Python and only have the piecewise cookie cutter knowledge of what I have found through numerous web pages.

That being said, I am trying to search through a file (~10k lines) for a set 'filter'-like criteria I wrote, and then I want it to print the lines that fit the criteria AND a line that is X amount of lines before it.

I have created the following script to open said file, iterate line by line, and print the line that meets the filter criteria to an output file, however I am stumped on how to incorporate this into the current script.

import os

output_file = 'Output.txt'
filename = 'BigFile.txt'                 

numLines = 0
numWords = 0
numChrs = 0
numMes = 0

f1 = open(output_file, 'w')
print 'Output File has been Opened'

with open(filename, 'r') as file:
   for line in file:
      wordsList = line.split()
      numLines += 1
      numWords += len(wordsList)
      numChrs += len(line)

      if "X" in line and "Y" not in line and "Z" in line:
          numMes += 1
          print >>f1, line
          print 'Object found and Catalogued in Output.txt'                          

print "Lines: %i\nWords: %i\nCharacters: %i" % (numLines, numWords, numChrs)
print >>f1, "Lines: %i\nWords: %i\nCharacters: %i" % (numLines, numWords, numChrs)

print "There are a total of %i thing in this file" % (numMes)
print >>f1, "There are a total of %i things in this file" % (numMes)

f1.close()

print 'Output Files have been Closed'

My first guess was using line.enumeration but I don't think I can just state something like lines - 5 to print the line that is 5 before lines:

lines = f1.enumeration()
if "blah blah" in line and "so so" not in line:
    print >>f1, lines
    print >>f1, [lines - 5]

The best part is yet to come though, because I have to take the Output.txt file and compare with another file to output the matching criteria in both files... but one step at a time, right?

-Also feel free to add in blurbs of 'proper' technique... I'm sure this script can be written a better way, so please do educate me on anything I am doing wrong.

Thanks in advance for any help!


UPDATE: Have successfully implemented the fix thanks to the help below:

import os

output_file = 'Output.txt'
filename = 'BigFile.txt'                 

numLines = 0
numWords = 0
numChrs = 0

numMulMes = 0

last5 = []

f1 = open(output_file, 'w')
print 'Output Files have been Opened'

with open(filename, 'r') as file:
    for line in file:
        wordsList = line.split()
        numLines += 1
        numWords += len(wordsList)
        numChrs += len(line)
        last5[:] = last5[-5:]+[line] 
        if "X" in line and "Y" not in line and "Z" not in line:
            del last5[1:5]           ###the missing piece of the puzzle!
            numMulMes += 1
            print >>f1, last5
            print 'Object found and Catalogued in Output.txt'

print "Lines: %i\nWords: %i\nCharacters: %i" % (numLines, numWords, numChrs)
print >>f1, "Lines: %i\nWords: %i\nCharacters: %i" % (numLines, numWords, numChrs)

print "There are a total of %i messages in this file" % (numMulMes)
print >>f1, "There are a total of %i messages in this file" % (numMulMes)

f1.close()
f3.close()

print 'Output Files have been Closed'

I kept trying to just modify the output file via another separate script, and for the longest time I was fighting str vs lst operation and error problems. Just decided to come back to the original script and throw it in there on a whim, and vioila.

Thanks for pushing me in the right direction, it was easy to figure out from there!

Symbal
  • 15
  • 5

4 Answers4

4

You solved most of the stuff yourself (counting words, lines, linenumbers etc.) - You can simply remember the last n lines while going through your file.

Example:

t = """"zero line
one line
two line
three line
four line 
five line 
six line
seven line 
eight line
""" 

last5 = [] # memory cell
for l in t.split("\n"):  # similar to your for line in file: 
    last5[:] = last5[-4:]+[l] # keep last 4 and add current line, inplace list mod 

    if "six" in l:
        print last5

You can also look at deque and specify a max-length (you need to import it)

from collections import deque

last5 = deque(maxlen=5)
for l in t.split("\n"): 
    last5.append(l) # will automatically only keep 5 (maxlen)

    if "six" in l:
        print last5

Output:

 # list version
 ['two line', 'three line', 'four line ', 'five line ', 'six line'] 

 # deque version
 deque(['two line', 'three line', 'four line ', 'five line ', 'six line'], maxlen=5) 
Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
  • Adding in a link to [Python Splice overview](https://stackoverflow.com/questions/509211/understanding-pythons-slice-notation) for above! – Symbal Apr 25 '18 at 00:22
  • Hey Patrick, I've hit a little bit of a snag... So I successfully tweaked and implemented the line recall script you supplied above! However, as expected from viewing the splice overview, the `[-4:]+[l]` bit takes ALL five lines prior... I only need the 5th line! And when I remove the : to stop the call for everything after the 5th line back, `[-4]+[l]` I get an error "IndexError: List index out of range" – Symbal Apr 25 '18 at 00:47
  • Figured it out! Thanks for your help. – Symbal Apr 25 '18 at 02:42
  • 4h ago was 3 am - fast asleep :) `if "six" in l: print last5[0]` +`print last5[-1]` prints the first line of the buffer, and the just recently added one :) – Patrick Artner Apr 25 '18 at 05:10
  • Oh! I needed to separate the print statements... I was wondering why I couldn't get it to work without the colon, either way, I got the output... now just need to work on cleaning it up since it appended and prepended some characters. Off to another Stack! Thanks again for the help Patrick! – Symbal Apr 25 '18 at 21:34
2

Here the same solution as @PatricArtner suggested but with the ring buffer. It may (or may not, I didn't check) work faster with big files. The idea is quite simple: we can create a list with the required size (number of lines you should keep) and a counter of current recording position cnt. For each new line, we should increase cnt by one and make modulo by size of our buffer. Therefore cnt is looping inside the list. For example if the list size is 5 cnt = (cnt+1)%5 will give 0 1 2 3 4 0 1 2 and so on. Every step the cnt will point to the oldest data in our list, which will be substituted by new data. An example of the realization is bellow.

t = """"zero line
six line - surprize 
one line
two line
three line
four line 
five line 
six line
seven line 
eight line
""" 


last5 = [None,None,None,None,None]
cnt = 0
for l in t.split("\n"):
  last5[cnt]=l
  if 'six' in l:
    print last5[(cnt+1)%5]
    print last5[(cnt+2)%5]
    print last5[(cnt+3)%5]
    print last5[(cnt+4)%5]
    print last5[(cnt+0)%5]
    print
  cnt = (cnt+1)%5

An output is quite simple:

None
None
None
"zero line
six line - surprize 

two line
three line
four line 
five line 
six line

NOTE: If you read from a file, and the file is quite big and strings which you need to keep are huge (for example, gene sequences) and your condition doesn't trigger so often, be clever, do not keep strings in memory. Create a list of positions in the file where last strings start and reread them if you need. Below is an example of how to make it very fast...

from numpy import random as rnd

print "Creating the file ...."
DNA=["G","C","T","A"]
with open("bigdatafile","w") as fd:
    for i in xrange(5000):
        fd.write("".join([ DNA[rnd.randint(4)] for x in xrange(2000)])+"\n")
print "DONE"
print
print "SEARCHING GGGGGGGGGGG"
last5, cnt = [0,0,0,0,0], 1
with open("bigdatafile","r") as fd:
    for i,l in enumerate(fd.readlines()):
        last5[cnt] = last5[(cnt+4)%5]+len(l)
        if "GGGGGGGGGGG" in l:
            print "FIND!"
            fd.seek(last5[(cnt+1)%5])
            print fd.read(last5[cnt]-last5[(cnt+1)%5])
        cnt = (cnt+1)%5
rth
  • 2,946
  • 1
  • 22
  • 27
  • I am currently working to incorporate and test @PatrickArtner 's script... I understand the basis of buffers and memory, and why this task can get tricky throwing in a `.read()` to a million+ line file, but would you elaborate on how the buffer comes into play in your script above? or hyperlink me a site or other stack question you think covers the information? Thanks for the help too! – Symbal Apr 24 '18 at 22:31
  • I updated my answer. Hope it is more clear now. Please see note at the end. Is that your situation? – rth Apr 25 '18 at 19:01
  • At first I didn't think I would be able to conceptualize the step-by-step you provided, but that made perfect sense to me... you lost me a little on something about modulo and buffers, but I get the step and iteration process you detailed and how it could be advantageous to use this technique when memory and possibly speed it vital... thank you for sharing! – Symbal Apr 25 '18 at 21:30
  • You are welcome! :) BUT note, if you need to save objects of constant size (integers, floats, complex) this technique will give you extraordinary improvement in performance because no new memory would be allocated as in original @PatrickArtner answer. However, it is not true for strings. They have different length and memory should be allocated independently for each string. Therefore if you need to process big strings, do not keep them in memory. Save position in the file where you find them and reread from this position if you need. For rare rereadings, this gives max performance. – rth Apr 25 '18 at 22:09
  • @Symbal I've updated my answer to the level of 'real world example' :) – rth Apr 25 '18 at 22:48
0

Instead of writing to file, I output stuff to a dictionary. Once the entire file is processed, the dictionary of summary data is dumped to file in form of a json. Using Artner's test file.

import os
import json

output_file = 'Output.txt'
filename = 'BigFile.txt'                 

#initiate output container
outDict = {}
for fields in ['numLines', 'numWords', 'numChrs', 'numMes']:
    outDict[fields] = 0

outDict['lineNum'] = []    

with open(filename, 'r') as file:
    for line in file:
      wordsList = line.strip().split("\s")
      outDict['numLines'] += 1
      outDict['numWords'] += len(wordsList)
      outDict['numChrs'] += len(line)

      #find items in the line
      if "t" in line:
          outDict['numMes'] += 1
          #save line number
          outDict['lineNum'].append(outDict['numLines']) 
          #save line content
          outDict['lineList'].append(line)

#record output          
with open(output_file, 'w') as f1:
    f1.write(json.dumps(outDict))    

##print lines of desire
#x number of lines before
x=5    
with open(filename, 'r') as file:
    for i, line in enumerate(file):
        #iterate over line numbers for which condition is met
        for j in range(0,len(outDict['lineNum'])):
            #if line number is between found line num and line num minus x, print
            if (outDict['lineNum'][j]-x) <= i <= outDict['lineNum'][j]:
                print(line)
Gene Burinsky
  • 9,478
  • 2
  • 21
  • 28
  • file is a stream - I do not know of any `file.index(line)` - what magic is that? – Patrick Artner Apr 24 '18 at 22:02
  • @PatrickArtner indeed, i'm working to work that out. Thanks for mentioning it – Gene Burinsky Apr 24 '18 at 22:02
  • file is still a stream, this `file[toPrint-X:toPrint]` does not work – Patrick Artner Apr 24 '18 at 22:08
  • his file output consists of each line-numer + line content that passes his if-condition. your dictionary only remembers the line numbers, not the lines content. your output is considerably less then his. you might get away with this if you `for index,line in enumerate(file):` and use the `index` ( the line.index 0 based as integer) as key into your dict to store each `line` that fits the condition ... but still, you need the -X lines. If the file is small enough one could simply store all lines in a dict but with some GB data this wont work... – Patrick Artner Apr 24 '18 at 22:11
  • @PatrickArtner i'm not sure what you mean. `outDict['lineNum'].append(outDict['numLines'])` should append a list of line numbers for which the condition is met. – Gene Burinsky Apr 24 '18 at 22:17
  • Yes - line _numbers_ - not _number and line content_ as his does. – Patrick Artner Apr 24 '18 at 22:18
0

Since I mentioned it in the comments, here is how to do the same thing on a *nix machine using grep's context line control features.

First assume you have the following text file test.txt:

zero line
one line
two line
three line
four line 
five line 
six line
seven line 
eight line

If you want to get N lines before a match, you can use the -B option. For example, for 5 lines before "six":

$ grep -B 5 six test.txt 
one line
two line
three line
four line 
five line 
six line

There is also the -A option which you can use to get N lines after a match and -C which you can use to get N lines before AND after.

pault
  • 41,343
  • 15
  • 107
  • 149
  • Ahh thank you so much for elaborating with an example and linking to a full tutorial spot! I will definitely try this on the Unix terminal next time I need to fiddle with these files on the server. – Symbal Apr 25 '18 at 21:26
  • Found this site, and #6 walks you through the A/B/C functions! https://www.thegeekstuff.com/2009/03/15-practical-unix-grep-command-examples/ – Symbal Aug 29 '18 at 21:26