Working with huge lists more efficently (Memory wise)

Question

I'm trying to read big files(~10GB) of text data and put each string into a list.

corpus = []
for file in files:
        fc = []

        with open(file) as source:
            # Use Multiprocessing to read all lines and add them to the list
            filewords = pool.map(addline, source)

            #Concatenate each sublist in filewords to one list with all stringwords
            filewords = list(itertools.chain(*filewords))

        corpus.append(filewords)

#do something with list
function(corpus)

What should I do to make this more memory efficient? With generators maybe? (I have no experience with them)

why are you using multiprocessing? feels like it would only slow things down? If you can work with ASCII/UTF-8 encoded strings, you could find my [mmap approach](http://stackoverflow.com/questions/28643919/python-string-processing-optimization) useful. — Antti Haapala -- Слава Україні, Mar 25 '16 at 22:31
"Use Multiprocessing to read all lines and add them to the list" - why? — user2357112, Mar 25 '16 at 22:31
I have access to a cluster and thought Multiprocessing would make this Process quicker — BadlyworkingAI, Mar 25 '16 at 22:35
The multiprocessing module doesn't do anything with clusters. If you want to take advantage of your cluster, you'll need to use different techniques, probably dependent on how your cluster is set up. — user2357112, Mar 25 '16 at 22:38
To count the word frequencies and get the top K most frequent words — BadlyworkingAI, Mar 25 '16 at 22:46
Are you using the filewords list more than once? Also what is `cc`? — Padraic Cunningham, Mar 25 '16 at 23:01
if you only want to count words you don't need the whole thing in a list, that is a 10GB waste of memory, is better do as show by DevShark in his answer, that way the use of memory is a much as the memory used by the counter you use... — Copperfield, Mar 25 '16 at 23:04

DevShark · Answer 1 · 2016-03-25T22:51:45.603

2

I would actually not necessarily use multiprocessing in that case. 10GB is no that much, and you can easily do something simple like this:

for file in files:
   with open(file) as source:
        for line in source:
             # process

If you want to use your cluster, do not use multiprocessing, but use the API for your cluster.

edited Mar 25 '16 at 22:51

answered Mar 25 '16 at 22:40

DevShark

8,558
9
32
56

score 1 · Answer 2 · answered Mar 25 '16 at 23:04

Like Antti Happala suggested, see if mmap is a usable solution for you.

If not, you might be able to use a generator, but it really depends on what you're doing with that ~10 GB text file. If you go down the generator road, I'd suggest that you make a class and override the __iter__ method. This way, if you have to iterate the file more than once, you always get a generator that starts at the beginning of the file.

This is important if you pass the generator around between functions.

Generators that are made from a function return a reference to the generator for iteration.
Overriding __iter__ returns a new generator.

function generator:

def iterfile(my_file):
    with open(my_file) as the_file:
        for line in the_file:
            yield line

__iter__ generator:

class IterFile(object):

    def __init__(self, my_file):
        self.my_file = my_file

    def __iter__(self):
        with open(self.my_file) as the_file:
            for line in the_file:
                yield line

Difference in behavior:

>>> func_gen = iterfile('/tmp/junk.txt')
>>> iter(func_gen) is iter(func_gen)
True

>>> iter_gen = IterFile('/tmp/junk.txt')
>>> iter(iter_gen) is iter(iter_gen)
False

>>> list(func_gen)
['the only line in the file\n']
>>> list(func_gen)
[]

>>> list(iter_gen)
['the only line in the file\n']
>>> list(iter_gen)
['the only line in the file\n']

mmap requires a contiguous block of memory in main processes address space large enough for the whole file object so for large files it may not be feasible http://stackoverflow.com/questions/1661986/why-doesnt-pythons-mmap-work-with-large-files — Padraic Cunningham, Mar 25 '16 at 23:13

Working with huge lists more efficently (Memory wise)

2 Answers2