2

I have a system with 32 GB of memory, with most of it available to the work I'm trying to do:

$ more /proc/meminfo
MemFree:        29535136 kB
MemAvailable:   30789956 kB
...

I have some code that encodes letters in a string to vectors:

#!/usr/bin/env python                                                                                                       

import os
import sys
import numpy as np
from Bio import SeqIO
import errno
import gzip
import shutil

seq_encoding = {'A' : [1, 0, 0, 0],
                'C' : [0, 1, 0, 0],
                'G' : [0, 0, 1, 0],
                'T' : [0, 0, 0, 1],
                'N' : [0, 0, 0, 0]}

sequence_chunk_length = 200

def sequence_split_by_length(seq, n):
    """                                                                                                                     
    A generator to divide a sequence into chunks of n characters and return                                                 
    the base array.                                                                                                         
    """
    while seq:
        yield [seq_encoding[base] for base in seq[:n].upper()]
        seq = seq[n:]

def encode_chromosome(name, length):
    enc_records = []
    fasta_fn = os.path.join(fasta_directory, name + '.fa')
    fasta_fh = open(fasta_fn, "rU")
    for record in SeqIO.parse(fasta_fh, "fasta"):
        for chunk in sequence_split_by_length(str(record.seq), sequence_chunk_length):
            enc_records.extend(np.asarray(chunk))
    fasta_fh.close()
    enc_arr = np.asarray(enc_records)

# ... some more code not relevant to exception ...

Encoding fails at the line:

enc_arr = np.asarray(enc_records)

Here is the relevant part of the thrown exception:

Traceback (most recent call last):
  File "./encode_sequences.py", line 95, in <module>
    res = encode_chromosome(chromosome_name, sequence_chunk_length)
  File "./encode_sequences.py", line 78, in encode_chromosome
    enc_arr = np.asarray(enc_records)
  ...
MemoryError

The data structure that would be encoded is about 1 GB in size, which would seem to fit within the free memory available on this system.

Is there an alternative method or procedure for converting a Python list to a Numpy array, which helps to get around MemoryError exceptions with Numpy methods like asarray()?

Alex Reynolds
  • 95,983
  • 54
  • 240
  • 345
  • You should check how much memory is actually used by python process: [Python script knows how much memory it's using](http://stackoverflow.com/questions/1240581/python-script-knows-how-much-memory-its-using) – Moinuddin Quadri Nov 20 '16 at 09:44
  • `np.asarray` is not your only object here, before it gets to that line you have a lot of other objects like `SeqIO.parse` or `chunk` the the `enc_records` itself. – Mazdak Nov 20 '16 at 09:47
  • Also when you use `np.asarray` it will consider a default data type for your array which is usually more than what you need. You can specify a proper type manually in order to consume less memory. For example `int8` instead of `int64` and etc. – Mazdak Nov 20 '16 at 09:50
  • Tell us about `enc_records`. Length, length of a sublist. The result of applying `np.array` to a subset of the list. Do you expect a sparse array, mostly 0s? – hpaulj Nov 20 '16 at 10:32

1 Answers1

0

This should be a quick fix:

enc_records = []
for record in SeqIO.parse(fasta_fh, "fasta"):
    for chunk in sequence_split_by_length(str(record.seq), sequence_chunk_length):
        enc_records.append(np.asarray(chunk, dtype=np.int8))
enc_arr = np.vstack(enc_record)

I changed 2 things:

  • Use dtype int8 instead of the default integer dtype. The default numpy integer is 32 or 64 bits depending on platform, while int8 is only 8 bits.
  • Collect 2D arrays (200x4) in a list instead of (1x4) vectors, since numpy arrays have significant memory overhead for small arrays.
user7138814
  • 1,991
  • 9
  • 11