0

This is the python code i'm using. I have a 5gb file which i need to split in around 10-12 files according to line numbers. But this code gives a memory error. Please can someone tell me what is wrong with this code?

from itertools import izip_longest

def grouper(n, iterable, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

n = 386972

with open('reviewsNew.txt','rb') as f:
    for i, g in enumerate(grouper(n, f, fillvalue=''), 1):
        with open('small_file_{0}'.format(i * n), 'w') as fout:
            fout.writelines(g)
Kanika Rawat
  • 183
  • 1
  • 1
  • 10
  • http://stackoverflow.com/questions/6475328/read-large-text-files-in-python-line-by-line-without-loading-it-in-to-memory Similar question – be_good_do_good Jul 24 '16 at 18:16
  • @be_good_do_good it's not the same. In my code after 3-4 iterations , i am getting a memory error. I dont know why. I am reading the file line by line :( – Kanika Rawat Jul 24 '16 at 18:23
  • maybe not the issue but you are opening the input file as binary but you are not saving the same way. – Mikael Rousson Jul 24 '16 at 18:42
  • @MikaelRousson when i open my file in text , only 139 lines are read and it stops. But when i open it in binary the whole file gets read atleast :P Is it a problem to open it in binary and saving it as a text file ? – Kanika Rawat Jul 25 '16 at 04:55

1 Answers1

0

Just use groupby, so you don't need to create 386972 iterators:

from itertools import groupby

n = 386972
with open('reviewsNew.txt','rb') as f:
    for idx, lines in groupby(enumerate(iterable), lambda (idx, _): idx // n):
        with open('small_file_{0}'.format(idx * n), 'wb') as fout:
            fout.writelines(l for _, l in lines)
Daniel
  • 42,087
  • 4
  • 55
  • 81