1

I am trying to open specific lines of multiple files and return the lines of each file. My solution is taking quite time-consuming. do you have any suggestion?
func.filename: the name of the given file
func.start_line: the starting point in the given file
func.endline: finishing point in the given file

def method_open(func):
    try:
        body = open(func.filename).readlines()[func.start_line:
                                               func.end_line]
    except IOError:
        body = []
        stderr.write("\nCouldn't open the referenced method inside {0}".
                     format(func.filename))
        stderr.flush()
    return body

Have in mind that sometimes the opening file func.filename can be the same but unfortunately, this is not the case most of the time.

  • Why is it taking so much time? Is the file very large? If so then how large? – Muhammad Tahir Apr 05 '16 at 14:30
  • You can try the [linecache](https://docs.python.org/2/library/linecache.html) module or `itertools.islice` and see if they're also too time consuming. See more details here: http://stackoverflow.com/questions/2081836/reading-specific-lines-only-python – Paulo Almeida Apr 05 '16 at 14:47
  • no, I think the problem is that I am opening and closing the file over and over again, abut linecache can be interesting if it had the capability of muliple lines at the same time. – Mehrdad Mehraban Apr 05 '16 at 15:37
  • I think I will have to time each solution to know which one is the fastest. – Mehrdad Mehraban Apr 05 '16 at 15:43

1 Answers1

2

The problem with readlines is that it reads the whole file into memory and linecache does the same.

You can save some time by reading one line at a time and breaking the loop as soon as you reach func.endline

But the best method i found is to use itertools.islice

Here the results of some tests I have done on a 130MB file of ~9701k lines:

--- 1.43700003624 seconds --- f_readlines
--- 1.00099992752 seconds --- f_enumerate
--- 1.1400001049 seconds --- f_linecache
--- 0.0 seconds --- f_itertools_islice

Here you can find the script I used

import time
import linecache
import itertools


def f_readlines(filename, start_line, endline):
    with open(filename) as f:
        f.readlines()[5000:10000]


def f_enumerate(filename, start_line, endline):
    result = []
    with open(filename) as f:
        for i, line in enumerate(f):
            if i in range(start_line, endline):
                result.append(line)
            if i > endline:
                break


def f_linecache(filename, start_line, endline):
    result = []
    for n in range(start_line, endline):
        result.append(linecache.getline(filename, n))


def f_itertools_islice(filename, start_line, endline):
    result = []
    with open(filename) as f:
        resultt = itertools.islice(f, start_line, endline)
        for i in resultt:
            result.append(i)


def runtest(func_to_test):
    filename = "testlongfile.txt"
    start_line = 5000
    endline = 10000
    start_time = time.time()
    func_to_test(filename, start_line, endline)
    print("--- %s seconds --- %s" % ((time.time() - start_time),func_to_test.__name__))

runtest(f_readlines)
runtest(f_enumerate)
runtest(f_linecache)
runtest(f_itertools_islice)
Francesco
  • 4,052
  • 2
  • 21
  • 29
  • interesting! This actually answers my question on which method is the fastest unlike the mentioned question: http://stackoverflow.com/questions/2081836/reading-specific-lines-only-python Thank you! – Mehrdad Mehraban Apr 06 '16 at 07:49