I am looking at biological features called Modules, which are made up in turn of features called smCOGs.
I would like to convert a nested list of Module/smCOG data (input
) to a nested array (output
) containing row elements that correspond to processed list objects in the nested list. Each list in the input nested list contains the module ID at index 0, a list of component smCOG IDs, and a list of nans for the smCOGs that are not present. I would like to convert a given internal list to an array with values of N at index N-1, where N is the integer smcog ID of a component smcog, and 0 at all other indexes (or some other filler value like nan).
As an example of my processing, assuming there are a total of 3 modules (rather than 185000ish) made up of 5 possible smCOG objects (rather than 22000ish):
input (nested list)
[['Module1', 'SMCOG1', 'SMCOG3', 'SMCOG5', np.nan, np.nan],
['Module2', 'SMCOG1', 'SMCOG2', 'SMCOG4', np.nan, np.nan],
['Module3', 'SMCOG1', 'SMCOG2', 'SMCOG3', 'SMCOG5', np.nan]
output (nested numpy array, 3 rows x 5 columns)
[[1, 0, 3, 0, 5],
[1, 2, 0, 4, 0],
[1, 2, 3, 0, 5]
It's currently taking a long time (about two hours on my machine I believe). Can someone tell me a better way to go about this? I believe the key time sink is the way I am appending arrays to my nested array, as np.stack
seems to work faster than np.vstack
(described at the bootom of the comments). This in turn suggests that the preceding functions aren't a big drag.
Thanks! Tim
count = 0
#initial array to append my other arrays too
nested_array = np.zeros(22754)
time_start = time.time()
#module can be any module in a 185000ish-object list.
#Each module is made up of SMCOG features ranging from smCOG #1 to 22754.
for module in modules_list:
#module = ['Module_ID', smcogA, smcogB, ..., np.nan, np.nan]
#A and B = ints cooresponding to smCOGs included in module.
#np.nan = one per each smCOG in 22754 possible smCOGs that are not present in the module.
#np.nan sequence always comes after sequence of smcogs, the sequences are not mixed together
#JOB - convert module to an array of ints where the smcogs are index-based i.e. smcog A is at array index A instead of module list index 1 (if A is not 1)
#most of the smcog features are not in a given module, so initialise an individual module to all 0's
array = np.zeros(22754)
#make array indexes corresponding to an smCOG feature non-zero.
for obj in module:
#If obj is np.nan then all smCOGs for module are found
if isinstance(obj, float):
break
#if it has smCOG in it then its a module feature and object at corresponding index should be updated
if 'SMCOG' in obj:
smcog_number = int(obj[obj.index('SMCOG') + 5 :])
array[smcog_number-1] = smcog_number
#I suspect this is my timesink? I need to add my module array as new row to nested array.
#np.append throws errors when array and nested array have differnet dimensions (i.e. when I've added a row to nested array)
#if I do np.stack([nested_array, array], axis = 0) instead of nested_array = np.vstack([nested_array, array]), it all takes 10 seconds. I'm guessing this is due to a combination of me not assigning the stack output to a variable (so it takes less memory) and stack potentially being more efficient? If I do nested_array = np.stack([nested_array, array=, axis = 0) I get ValueError: all input arrays must have the same shape
nested_array = np.vstack([nested_array, array])
count += 1
if count % 1000 == 0:
print (f'done {count} modules in {round(time.time() - time_start, 2)} seconds') #about 1-200 seconds per 1000 arrays
EDIT
Making a nested list and converting this to an array as suggested by hpaulj was much quicker (about 10 seconds). I also ran into memory issues due to my array size, so also set my array dtype to boolean which eats less memory (this is discussed in point #3 here).
Working code:
count = 0
nested_list = []
time_start = time.time()
for module in modules_list:
array = np.zeros(22754, dtype=bool)#may help with memory issues
#make array indexes corresponding to an smCOG feature non-zero.
for obj in module:
#If obj is np.nan then all smCOGs for module are found
if isinstance(obj, float):
break
#if it has smCOG in it then its a module feature and object at corresponding index should be updated
if 'SMCOG' in obj:
smcog_number = int(obj[obj.index('SMCOG') + 5 :])
array[smcog_number-1] = 1 #boolean True
#nested_array = np.vstack([nested_array, array])
nested_list.append(array)
count += 1
if count % 1000 == 0:
print (f'done {count} modules in {round(time.time() - time_start, 2)} seconds') #about 100 seconds per 1000 arrays
nested_array = np.array(nested_list)