-5

I'm trying to read big files(~10GB) of text data and put each string into a list.

corpus = []
for file in files:
        fc = []

        with open(file) as source:
            # Use Multiprocessing to read all lines and add them to the list
            filewords = pool.map(addline, source)

            #Concatenate each sublist in filewords to one list with all stringwords
            filewords = list(itertools.chain(*filewords))

        corpus.append(filewords)

#do something with list
function(corpus)

What should I do to make this more memory efficient? With generators maybe? (I have no experience with them)

BadlyworkingAI
  • 171
  • 1
  • 10

2 Answers2

2

I would actually not necessarily use multiprocessing in that case. 10GB is no that much, and you can easily do something simple like this:

for file in files:
   with open(file) as source:
        for line in source:
             # process

If you want to use your cluster, do not use multiprocessing, but use the API for your cluster.

DevShark
  • 8,558
  • 9
  • 32
  • 56
1

Like Antti Happala suggested, see if mmap is a usable solution for you.

If not, you might be able to use a generator, but it really depends on what you're doing with that ~10 GB text file. If you go down the generator road, I'd suggest that you make a class and override the __iter__ method. This way, if you have to iterate the file more than once, you always get a generator that starts at the beginning of the file.

This is important if you pass the generator around between functions.

  • Generators that are made from a function return a reference to the generator for iteration.

  • Overriding __iter__ returns a new generator.

function generator:

def iterfile(my_file):
    with open(my_file) as the_file:
        for line in the_file:
            yield line

__iter__ generator:

class IterFile(object):

    def __init__(self, my_file):
        self.my_file = my_file

    def __iter__(self):
        with open(self.my_file) as the_file:
            for line in the_file:
                yield line

Difference in behavior:

>>> func_gen = iterfile('/tmp/junk.txt')
>>> iter(func_gen) is iter(func_gen)
True

>>> iter_gen = IterFile('/tmp/junk.txt')
>>> iter(iter_gen) is iter(iter_gen)
False

>>> list(func_gen)
['the only line in the file\n']
>>> list(func_gen)
[]

>>> list(iter_gen)
['the only line in the file\n']
>>> list(iter_gen)
['the only line in the file\n']
willnx
  • 1,253
  • 1
  • 8
  • 14
  • mmap requires a contiguous block of memory in main processes address space large enough for the whole file object so for large files it may not be feasible http://stackoverflow.com/questions/1661986/why-doesnt-pythons-mmap-work-with-large-files – Padraic Cunningham Mar 25 '16 at 23:13