-1

I am reading a big file in chunks like

>  def gen_data(data):
>             for i in range(0, len(data), chunk_sz):
>                 yield data[i: i + chunk_sz]

If I use length variable instead of len(data) , something like that

length_of_file = len(data)
def gen_data(data):
    for i in range(0, length_of_file, chunk_sz):
        yield data[i: i + chunk_sz]

What will be the performance improvements for big files. I tested for small one's but didn't see any change.

P.S I am from C/C++ background where calculating in each repetition in while or for loop is a bad practice because it executes for every call.

Kartik Thakurela
  • 141
  • 1
  • 11
  • 2
    You are not "reading a file using range function" - files are streams ... you cant index into them like this. If you got the files data already completely inside `data` .. why chunk it? – Patrick Artner Jan 28 '19 at 12:16
  • On a different note, since you are using the variable `length_of_file` inside the function, it's better to define it in the function itself to avoid any possible conflict with the global variable if any of the same name. So put `length_of_file = len(data)` before the for loop in the function – Sheldore Jan 28 '19 at 12:17
  • [how-do-you-split-a-list-into-evenly-sized-chunks](https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks) explains about chunking – Patrick Artner Jan 28 '19 at 12:19

2 Answers2

0

Use this code to read big file into chunks:

def read_in_chunks(file_object, chunk_size=1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data


f = open('really_big_file.dat')
for piece in read_in_chunks(f):
    process_data(piece)

Another option using iter

f = open('really_big_file.dat')
def read1k():
    return f.read(1024)

for piece in iter(read1k, ''):
    process_data(piece)
Anonymous
  • 659
  • 6
  • 16
0

Python's for loop are not C for loop, but really foreach kind of loops. In your example:

for i in range(0, len(data), chunk_sz):

range() is only called once, then python iterates on the return value (a list in python2, an iterable range object in python3). IOW, from this POV, your snippets are equivalent - the difference being that the second snippet uses a non-local variable length_of_file, so you actually get a performance hit from resolving it.

I am from C/C++ background where calculating in each repetition in while or for loop is a bad practice because it executes for every call

Eventual compiler optimizations set asides, this is true for most if not all languages.

This being said and as other already mentionned in comments or answers: this is not how you read a file in chunks - you want SurajM's first snippet.

bruno desthuilliers
  • 75,974
  • 6
  • 88
  • 118