0

This is the code I have, to count the frequency

import collections
import codecs
import io
from collections import Counter
with io.open('Combine.txt', 'r', encoding='utf8') as infh:
    words =infh.read().split()
    with open('Counts2.txt', 'wb') as f:
        for word, count in Counter(words).most_common(100000000):
            f.write(u'{} {}\n'.format(word, count).encode('utf-8')) 

When I try to read a big file( 4 GB) I am getting error

Traceback (most recent call last):
  File "counter.py", line 7, in <module>
    words =infh.read().split()
  File "/usr/lib/python2.7/codecs.py", line 296, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
MemoryError

I am using Ubuntu 12.4, 8 GB RAM Intel Core i7 How to fix this error ? /

usr/lib/python2.7/codecs.py", line 296, in decode
        (result, consumed) = self._buffer_decode(data, self.errors, final)
    MemoryError

2 Answers2

2

This is the pythonic way to process a file line-by-line:

with open(...) as fh:
    for line in fh:
        pass

This will take care of opening and closing the file, including if an exception is raised in the inner block, plus it treats the file object fh as an iterable, which automatically uses buffered I/O and manages memory so you don't have to worry about large files.

Michael Foukarakis
  • 39,737
  • 6
  • 87
  • 123
  • What if all the words are on a single line? – Jayanth Koushik Feb 11 '14 at 12:18
  • It should be trivial to either: a) convert it to one-word-per-line via your shell or b) read from a file in chunks (ie. manually manage memory) and process accordingly. – Michael Foukarakis Feb 11 '14 at 12:19
  • @MichaelFoukarakis errors is at usr/lib/python2.7/codecs.py", line 296, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) MemoryError –  Feb 11 '14 at 12:28
-2

How about readline instead of read()

http://docs.python.org/2/tutorial/inputoutput.html

user2814648
  • 421
  • 4
  • 12