I have a system with 32 GB of memory, with most of it available to the work I'm trying to do:
$ more /proc/meminfo
MemFree: 29535136 kB
MemAvailable: 30789956 kB
...
I have some code that encodes letters in a string to vectors:
#!/usr/bin/env python
import os
import sys
import numpy as np
from Bio import SeqIO
import errno
import gzip
import shutil
seq_encoding = {'A' : [1, 0, 0, 0],
'C' : [0, 1, 0, 0],
'G' : [0, 0, 1, 0],
'T' : [0, 0, 0, 1],
'N' : [0, 0, 0, 0]}
sequence_chunk_length = 200
def sequence_split_by_length(seq, n):
"""
A generator to divide a sequence into chunks of n characters and return
the base array.
"""
while seq:
yield [seq_encoding[base] for base in seq[:n].upper()]
seq = seq[n:]
def encode_chromosome(name, length):
enc_records = []
fasta_fn = os.path.join(fasta_directory, name + '.fa')
fasta_fh = open(fasta_fn, "rU")
for record in SeqIO.parse(fasta_fh, "fasta"):
for chunk in sequence_split_by_length(str(record.seq), sequence_chunk_length):
enc_records.extend(np.asarray(chunk))
fasta_fh.close()
enc_arr = np.asarray(enc_records)
# ... some more code not relevant to exception ...
Encoding fails at the line:
enc_arr = np.asarray(enc_records)
Here is the relevant part of the thrown exception:
Traceback (most recent call last):
File "./encode_sequences.py", line 95, in <module>
res = encode_chromosome(chromosome_name, sequence_chunk_length)
File "./encode_sequences.py", line 78, in encode_chromosome
enc_arr = np.asarray(enc_records)
...
MemoryError
The data structure that would be encoded is about 1 GB in size, which would seem to fit within the free memory available on this system.
Is there an alternative method or procedure for converting a Python list to a Numpy array, which helps to get around MemoryError
exceptions with Numpy methods like asarray()
?