32

Is there a limit to memory for python? I've been using a python script to calculate the average values from a file which is a minimum of 150mb big.

Depending on the size of the file I sometimes encounter a MemoryError.

Can more memory be assigned to the python so I don't encounter the error?


EDIT: Code now below

NOTE: The file sizes can vary greatly (up to 20GB) the minimum size of the a file is 150mb

file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w")

files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]

for u in files:
    line = u.readlines()
    list_of_lines = []
    for i in line:
        values = i.split('\t')
        list_of_lines.append(values)

    count = 0
    for j in list_of_lines:
        count +=1

    for k in range(0,count):
        list_of_lines[k].remove('\n')

    length = len(list_of_lines[0])
    print_counter = 4

    for o in range(0,length):
        total = 0
        for p in range(0,count):
            number = float(list_of_lines[p][o])
            total = total + number
        average = total/count
        print average
        if print_counter == 4:
            file_write.write(str(average)+'\n')
            print_counter = 0
        print_counter +=1
file_write.write('\n')
martineau
  • 119,623
  • 25
  • 170
  • 301
Harpal
  • 12,057
  • 18
  • 61
  • 74
  • 2
    Can you show us your script? I've processed much bigger files in Python without issues. – robert Nov 26 '10 at 12:14
  • 1
    What is your script trying to do? It looks to me like you want to calculate the average value of every fourth column in each of the input files. Is that right? – Tim Pietzcker Nov 26 '10 at 12:51
  • I have noticed significant performance differences in regard to Memory when running the same Python application on Windows (XP) and OS X/Linux. The performance on the Windows side tends to be the worst. – SW_user2953243 Jan 18 '15 at 01:41

5 Answers5

34

(This is my third answer because I misunderstood what your code was doing in my original, and then made a small but crucial mistake in my second—hopefully three's a charm.

Edits: Since this seems to be a popular answer, I've made a few modifications to improve its implementation over the years—most not too major. This is so if folks use it as template, it will provide an even better basis.

As others have pointed out, your MemoryError problem is most likely because you're attempting to read the entire contents of huge files into memory and then, on top of that, effectively doubling the amount of memory needed by creating a list of lists of the string values from each line.

Python's memory limits are determined by how much physical ram and virtual memory disk space your computer and operating system have available. Even if you don't use it all up and your program "works", using it may be impractical because it takes too long.

Anyway, the most obvious way to avoid that is to process each file a single line at a time, which means you have to do the processing incrementally.

To accomplish this, a list of running totals for each of the fields is kept. When that is finished, the average value of each field can be calculated by dividing the corresponding total value by the count of total lines read. Once that is done, these averages can be printed out and some written to one of the output files. I've also made a conscious effort to use very descriptive variable names to try to make it understandable.

try:
    from itertools import izip_longest
except ImportError:    # Python 3
    from itertools import zip_longest as izip_longest

GROUP_SIZE = 4
input_file_names = ["A1_B1_100000.txt", "A2_B2_100000.txt", "A1_B2_100000.txt",
                    "A2_B1_100000.txt"]
file_write = open("average_generations.txt", 'w')
mutation_average = open("mutation_average", 'w')  # left in, but nothing written

for file_name in input_file_names:
    with open(file_name, 'r') as input_file:
        print('processing file: {}'.format(file_name))

        totals = []
        for count, fields in enumerate((line.split('\t') for line in input_file), 1):
            totals = [sum(values) for values in
                        izip_longest(totals, map(float, fields), fillvalue=0)]
        averages = [total/count for total in totals]

        for print_counter, average in enumerate(averages):
            print('  {:9.4f}'.format(average))
            if print_counter % GROUP_SIZE == 0:
                file_write.write(str(average)+'\n')

file_write.write('\n')
file_write.close()
mutation_average.close()
martineau
  • 119,623
  • 25
  • 170
  • 301
  • 7
    -1 (a) The OP is **NOT** "attempting to read several large files into memory all at once"; he's reading them one at a time. (b) The OP is however doubling up the memory taken by each file as he reads it [see my answer]. (c) Your code just won't work; `totals` and `field` are **str** objects; we want **numerical** totals to compute averages; your totals are going to grow into some very long strings; this is Python, not awk; you need to throw a few `float()`s in there (d) `totals = [field for field in fields]` instead of `totals = fields` ??? – John Machin Nov 27 '10 at 08:39
  • @John Machin: Good catches -- esp on the need to convert the string values into numeric values. That `totals = [field for...` was just an artifact from a point in my coding where I thought I needed a separate copy of the fields list. – martineau Nov 27 '10 at 16:16
  • @Harpal: Thanks, I hope it works for you now, too. I must say that @John Machin's criticisms were very beneficial in helping me arrive at my final answer and he deserves recognition for providing them. – martineau Nov 30 '10 at 11:55
  • 3
    Wait, so was the answer 'No there is no imposed memory limit'? – ThorSummoner Apr 23 '15 at 17:44
  • @ThorSummoner: From third paragraph: "_Python's memory limits are determined by how much physical ram and virtual memory disk space your computer and operating system have available._" – martineau Apr 23 '15 at 18:12
  • @martineau [defensively] Of course, that same limit is imposed on all processes. So its not a limit of python, its a limit of the system python is running on! Does not explicitly answer the explicit question IMO. – ThorSummoner Apr 23 '15 at 18:19
  • @ThorSummoner: Dunno, seems to me like it directly addresses the question. Although I didn't mention that it's also inherently limited by whether it's the 32- or 64-bit version of the interpreter. – martineau Apr 23 '15 at 18:37
20

You're reading the entire file into memory (line = u.readlines()) which will fail of course if the file is too large (and you say that some are up to 20 GB), so that's your problem right there.

Better iterate over each line:

for current_line in u:
    do_something_with(current_line)

is the recommended approach.

Later in your script, you're doing some very strange things like first counting all the items in a list, then constructing a for loop over the range of that count. Why not iterate over the list directly? What is the purpose of your script? I have the impression that this could be done much easier.

This is one of the advantages of high-level languages like Python (as opposed to C where you do have to do these housekeeping tasks yourself): Allow Python to handle iteration for you, and only collect in memory what you actually need to have in memory at any given time.

Also, as it seems that you're processing TSV files (tabulator-separated values), you should take a look at the csv module which will handle all the splitting, removing of \ns etc. for you.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
18

Python can use all memory available to its environment. My simple "memory test" crashes on ActiveState Python 2.6 after using about

1959167 [MiB]

On jython 2.5 it crashes earlier:

 239000 [MiB]

probably I can configure Jython to use more memory (it uses limits from JVM)

Test app:

import sys

sl = []
i = 0
# some magic 1024 - overhead of string object
fill_size = 1024
if sys.version.startswith('2.7'):
    fill_size = 1003
if sys.version.startswith('3'):
    fill_size = 497
print(fill_size)
MiB = 0
while True:
    s = str(i).zfill(fill_size)
    sl.append(s)
    if i == 0:
        try:
            sys.stderr.write('size of one string %d\n' % (sys.getsizeof(s)))
        except AttributeError:
            pass
    i += 1
    if i % 1024 == 0:
        MiB += 1
        if MiB % 25 == 0:
            sys.stderr.write('%d [MiB]\n' % (MiB))

In your app you read whole file at once. For such big files you should read the line by line.

Bertrand Caron
  • 2,525
  • 2
  • 22
  • 49
Michał Niklas
  • 53,067
  • 18
  • 70
  • 114
  • 5
    I run your script in my machine(win7-64, python27, 16GB memory), it crashes after using 1900 [MiB], but from task manager I know the available physical memory is about 8000M. So "Python can use all memory available to its environment" may not true. – lengxuehx Sep 11 '15 at 09:54
  • 5
    I was wrong. The reason it crashes is that a default 32-bit process gets 2GB limits in windows. – lengxuehx Sep 11 '15 at 10:11
  • 1
    Anyone know what the default windows python installer is 32-bit in 2018? – Elliot Sep 05 '18 at 06:54
10

No, there's no Python-specific limit on the memory usage of a Python application. I regularly work with Python applications that may use several gigabytes of memory. Most likely, your script actually uses more memory than available on the machine you're running on.

In that case, the solution is to rewrite the script to be more memory efficient, or to add more physical memory if the script is already optimized to minimize memory usage.

Edit:

Your script reads the entire contents of your files into memory at once (line = u.readlines()). Since you're processing files up to 20 GB in size, you're going to get memory errors with that approach unless you have huge amounts of memory in your machine.

A better approach would be to read the files one line at a time:

for u in files:
     for line in u: # This will iterate over each line in the file
         # Read values from the line, do necessary calculations
Pär Wieslander
  • 28,374
  • 7
  • 55
  • 54
6

Not only are you reading the whole of each file into memory, but also you laboriously replicate the information in a table called list_of_lines.

You have a secondary problem: your choices of variable names severely obfuscate what you are doing.

Here is your script rewritten with the readlines() caper removed and with meaningful names:

file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w") # not used
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for afile in files:
    table = []
    for aline in afile:
        values = aline.split('\t')
        values.remove('\n') # why?
        table.append(values)
    row_count = len(table)
    row0length = len(table[0])
    print_counter = 4
    for column_index in range(row0length):
        column_total = 0
        for row_index in range(row_count):
            number = float(table[row_index][column_index])
            column_total = column_total + number
        column_average = column_total/row_count
        print column_average
        if print_counter == 4:
            file_write.write(str(column_average)+'\n')
            print_counter = 0
        print_counter +=1
file_write.write('\n')

It rapidly becomes apparent that (1) you are calculating column averages (2) the obfuscation led some others to think you were calculating row averages.

As you are calculating column averages, no output is required until the end of each file, and the amount of extra memory actually required is proportional to the number of columns.

Here is a revised version of the outer loop code:

for afile in files:
    for row_count, aline in enumerate(afile, start=1):
        values = aline.split('\t')
        values.remove('\n') # why?
        fvalues = map(float, values)
        if row_count == 1:
            row0length = len(fvalues)
            column_index_range = range(row0length)
            column_totals = fvalues
        else:
            assert len(fvalues) == row0length
            for column_index in column_index_range:
                column_totals[column_index] += fvalues[column_index]
    print_counter = 4
    for column_index in column_index_range:
        column_average = column_totals[column_index] / row_count
        print column_average
        if print_counter == 4:
            file_write.write(str(column_average)+'\n')
            print_counter = 0
        print_counter +=1
John Machin
  • 81,303
  • 11
  • 141
  • 189
  • Not a big deal, but there's really no reason to `float` the values read into a separate list nor make their (columnar) totals real number -- only need to make sure their average values are calculated in that format. – martineau Nov 27 '10 at 05:31
  • @martineau: If by your first point you mean `values = map(float, values)`: I detest such typeshifting. Second point: how can columnar totals not be floats???? – John Machin Nov 27 '10 at 07:04
  • @John Machin: I meant that the values could be integers rather than floats. At time I was thinking they already were, but understand now that since they're initially strings they need to be converted to some kind numeric type. Given that I was thinking they were integers, it follows that their totals could also have been -- hence the 2nd point. Your conversion to `float` is probably correct, which would indeed require that the total be so, too. – martineau Nov 27 '10 at 15:31
  • Logic problem: I don't think the code shown in the revised version of the outer loop in your answer will work because `row_count` starts at `1`, so the `if not row_count:` initialization will never be executed. – martineau Nov 27 '10 at 16:49
  • @martineau: re the float business, the OP is using float(). Thanks for spotting the row_count bug; fixed. – John Machin Nov 27 '10 at 19:49
  • @John Machin: Guess I've learned the hard way working on this one to never post an answer here with Python code in it that I haven't tested (something I never done before). That would have eliminated most of the issues with the earlier attempts -- and could probably have also used the same test files to determine what the OP's code really did. – martineau Nov 27 '10 at 23:10