how to extract a certain number of lines from somewhere within a file

Question

I have an input file that looks something like this:

#nP 4
#mP 0.0262
#mH     10
#HP various info:
14  H   0.026
19  P   0.054
20  H   0.012
512 H   0.005
#xP
#kP
99
89
90

I want to extract 4 lines (because np = 4 in the first line) starting from line 5, so the output would be like this:

14  H   0.026
19  P   0.054
20  H   0.012
512 H   0.005

I have tried this:

import sys

head = sys.stdin.readline()
head = head.strip()
head = head.split('\t')
cntHetPos = int(head[1])
if "#HP" in sys.stdin.readlines():
  lines = sys.stdin.readlines()[0:cntHetPos]
  print lines

but it doesnt print out the lines, nor gives an error message. I based this on a previous answer I found here: Read file from line 2 or skip header row Ideas?

This might be able to help you? http://stackoverflow.com/questions/2081836/reading-specific-lines-only-python — Henrik Andersson, Apr 08 '13 at 10:52

Thomas · Accepted Answer · 2013-04-08T11:33:09.967

readlines() returns a list of all lines the first time you call it, but the second time, it's empty because all lines have already been read and consumed. Store them in a variable:

lines = sys.stdin.readlines()

Put that at the top because you might as well use it to read your head variable from:

head = lines[0]

The other problem is that you need to loop over all lines to find the #HP token, and you need to keep track of the line number so you can slice the list correctly:

for i, line in enumerate(lines):
  if "#HP" in line:
    lines = lines[i+1 : i+1+cntHetPos]

Finally, if you want to print the lines rather than the formatted list, you need to join them (note that the end-of-line character is already in there):

    print ''.join(lines),

And, for good measure, we can stop as soon as we've found the right line, so we break right after the print.

To sum up:

import sys

lines = sys.stdin.readlines()
head = lines[0]
head = head.strip()
head = head.split('\t')
cntHetPos = int(head[1])
for i, line in enumerate(lines):
  if "#HP" in line:
    lines = lines[i+1 : i+1+cntHetPos]
    print ''.join(lines),
    break

I tried but it didn´t work... do I not need to collect the lines I want to print out outside the loop? — edg, Apr 08 '13 at 11:28

score 0 · Answer 2 · 2013-04-08T11:11:08.933

0

This is a pretty ugly matching pattern, but it might fit your needs;

/#nP.*?#HP.*?$.*?(\d+ +\w +[\d\.]+).*?(\d+ +\w +[\d\.]+).*?(\d+ +\w +[\d\.]+).*?(\d+ +\w +[\d\.]+)/gsm

It will group your results of the 4 lines you want to grab and put them into groups. You could even subgroup the results so you instantly get the 14 H 0.026 separately. Something like;

(\d+) +(\w) +([\d\.]+)

Example

import re

string = '''#nP 4
#mP 0.0262
#mH     10
#HP various info:
14  H   0.026
19  P   0.054
20  H   0.012
512 H   0.005'''

result = re.findall('#nP.*?#HP.*?$.*?(\d+ +\w +[\d\.]+).*?(\d+ +\w +[\d\.]+).*?(\d+ +\w +[\d\.]+).*?(\d+ +\w +[\d\.]+)', string, re.S | re.M)

print(result)

Output

[('14  H   0.026', '19  P   0.054', '20  H   0.012', '512 H   0.005')]

edited Apr 08 '13 at 11:11

answered Apr 08 '13 at 11:02

1

_Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems._ —jzw – Thomas Apr 08 '13 at 11:05
What's the problem with some of you people on StackOverflow. Geez.. Pessimism much? – Apr 08 '13 at 11:07
Some tongue-in-cheek intended. Sorry :) But I still don't think a regex is the right way of solving this. (Also, it doesn't solve the problem as presented, because the value 4 is not static; it's read from the file.) – Thomas Apr 08 '13 at 11:11
It makes me think alot of the *why is jQuery needed for every JavaScript problem* spot-links people put in comments. I'll leave it to the OP to decide if it's useful. I'm not posting answers on SO to hog reputation. Just hope it can help the OP :) If we would live in a land of rainbows and unicorns, we probably would've never needed computers ;) – Apr 08 '13 at 11:15
Yep, always good to have multiple different solutions to choose from! – Thomas Apr 08 '13 at 11:16

score 0 · Answer 3 · answered Apr 08 '13 at 11:24

0

Perhaps something like:

from itertools import islice

with open('yourfile') as fin:
    count = int(next(fin).split()[1])
    non_comments = (line for line in fin if not line.startswith('#'))
    print list(islice(non_comments, None, count))
    # ['14  H   0.026\n', '19  P   0.054\n', '20  H   0.012\n', '512 H   0.005\n']

answered Apr 08 '13 at 11:24

Jon Clements

138,671
33
247
280

This does not work since there are other lines in the file that do not start with #. I only want to extract the nP number of lines (a number always given in the first line after nP in the files) after the line that starts with 'HP'. – edg Apr 08 '13 at 11:32

Adam Matan · Answer 4 · 2013-04-08T11:39:14.407

The linecache module is tailored for efficient line reading from files:

The linecache module allows one to get any line from any file, while attempting to optimize internally, using a cache, the common case where many lines are read from a single file. This is used by the traceback module to retrieve source lines for inclusion in the formatted traceback.

Assuming the text file is called blah, and that the file data starts at the fifth line:

#!/usr/bin/python   

import linecache

starting_line_number = 5   
number_of_lines      = int(linecache.getline('blah',1).split()[1])
for line_num in range(starting_line_number, starting_line_number+number_of_lines):
    print linecache.getline('blah',line_num),

how to extract a certain number of lines from somewhere within a file

4 Answers4