0

I am looking at biological features called Modules, which are made up in turn of features called smCOGs.

I would like to convert a nested list of Module/smCOG data (input) to a nested array (output) containing row elements that correspond to processed list objects in the nested list. Each list in the input nested list contains the module ID at index 0, a list of component smCOG IDs, and a list of nans for the smCOGs that are not present. I would like to convert a given internal list to an array with values of N at index N-1, where N is the integer smcog ID of a component smcog, and 0 at all other indexes (or some other filler value like nan).

As an example of my processing, assuming there are a total of 3 modules (rather than 185000ish) made up of 5 possible smCOG objects (rather than 22000ish):

input (nested list)

[['Module1', 'SMCOG1', 'SMCOG3', 'SMCOG5', np.nan,   np.nan],
 ['Module2', 'SMCOG1', 'SMCOG2', 'SMCOG4', np.nan,   np.nan],
 ['Module3', 'SMCOG1', 'SMCOG2', 'SMCOG3', 'SMCOG5', np.nan]

output (nested numpy array, 3 rows x 5 columns)

[[1, 0, 3, 0, 5],
 [1, 2, 0, 4, 0],
 [1, 2, 3, 0, 5]

It's currently taking a long time (about two hours on my machine I believe). Can someone tell me a better way to go about this? I believe the key time sink is the way I am appending arrays to my nested array, as np.stack seems to work faster than np.vstack (described at the bootom of the comments). This in turn suggests that the preceding functions aren't a big drag.

Thanks! Tim

count = 0

#initial array to append my other arrays too
nested_array = np.zeros(22754)
time_start = time.time()

#module can be any module in a 185000ish-object list.  
#Each module is made up of SMCOG features ranging from smCOG #1 to 22754.  
for module in modules_list:
    #module = ['Module_ID', smcogA, smcogB, ..., np.nan, np.nan]
    #A and B = ints cooresponding to smCOGs included in module.
    #np.nan = one per each smCOG in 22754 possible smCOGs that are not present in the module.  
    #np.nan sequence always comes after sequence of smcogs, the sequences are not mixed together
    
    #JOB - convert module to an array of ints where the smcogs are index-based i.e. smcog A is at array index A instead of module list index 1 (if A is not 1)

    #most of the smcog features are not in a given module, so initialise an individual module to all 0's
    array = np.zeros(22754)
    
    #make array indexes corresponding to an smCOG feature non-zero.
    for obj in module:
        
        #If obj is np.nan then all smCOGs for module are found
        if isinstance(obj, float):
            break
        #if it has smCOG in it then its a module feature and object at corresponding index should be updated
        if 'SMCOG' in obj:
            smcog_number = int(obj[obj.index('SMCOG') + 5 :])
            array[smcog_number-1] = smcog_number
            
    #I suspect this is my timesink?  I need to add my module array as new row to nested array.  
    #np.append throws errors when array and nested array have differnet dimensions (i.e. when I've added a row to nested array)
    #if I do np.stack([nested_array, array], axis = 0) instead of nested_array = np.vstack([nested_array, array]), it all takes 10 seconds.  I'm guessing this is due to a combination of me not assigning the stack output to a variable (so it takes less memory) and stack potentially being more efficient?  If I do nested_array = np.stack([nested_array, array=, axis = 0) I get ValueError: all input arrays must have the same shape 
    nested_array = np.vstack([nested_array, array])
    
    count += 1
    if count % 1000 == 0:
        print (f'done {count} modules in {round(time.time() - time_start, 2)} seconds') #about 1-200 seconds per 1000 arrays

EDIT

Making a nested list and converting this to an array as suggested by hpaulj was much quicker (about 10 seconds). I also ran into memory issues due to my array size, so also set my array dtype to boolean which eats less memory (this is discussed in point #3 here).

Working code:

count = 0
nested_list = []
time_start = time.time()
  
for module in modules_list:
    array = np.zeros(22754, dtype=bool)#may help with memory issues
    
    #make array indexes corresponding to an smCOG feature non-zero.
    for obj in module:
        
        #If obj is np.nan then all smCOGs for module are found
        if isinstance(obj, float):
            break
        #if it has smCOG in it then its a module feature and object at corresponding index should be updated
        if 'SMCOG' in obj:
            smcog_number = int(obj[obj.index('SMCOG') + 5 :])
            array[smcog_number-1] = 1 #boolean True
            
    #nested_array = np.vstack([nested_array, array])
    nested_list.append(array)

    count += 1
    if count % 1000 == 0:
        print (f'done {count} modules in {round(time.time() - time_start, 2)} seconds') #about 100 seconds per 1000 arrays
        
nested_array = np.array(nested_list)
Tim Kirkwood
  • 598
  • 2
  • 7
  • 18
  • 1
    I haven't read your code in detail, but I see you are using `vstack` repeatedly in a loop. Try not to do this!. `vstack`, `stack` (and `np.append`) all use `np.concatenate`, and are best used with a big list of arrays. They make a new array, with all the required copying. Use list `append` method to collect the arrays in a list, and use the appropriate form of `concatenate` just once, to join them all into one array. – hpaulj Jan 14 '22 at 07:24
  • Excellent, thanks very much - i just made a nested list, then called np.array(nested_list_of_arrays) on that and it takes 10 seconds (see edit for anyone with similar issue). Can you you make your comment an answer @hpaulj ? – Tim Kirkwood Jan 14 '22 at 08:13

0 Answers0