0

I've got a function that takes different parameters as input to go and look into a reference table for the cell matching those parameters (using .loc function). This function is part of a larger function but when I profiled my code I realised 99% of the time was spent trying to loc the cell and I don't know if it's possible to speed this up..

The reference table on which the loc is performed is about 500k rows. Some columns contains string, some others contains float.

Here's the profiled code:

Timer unit: 1e-06 s

Total time: 0.041261 s
File: <ipython-input-106-62a9b3c7d0c0>
Function: convert_position at line 38

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    38                                           def convert_position(transcript, exon, delta, genome=gtf_test):
    39                                           
    40                                           
    41         1      41259.0  41259.0    100.0      start = genome.loc[(genome['transcript_id'].values == transcript) & (genome['exon_number'].values == str(exon)), 'Start'].item()
    42                                               
    43         1          2.0      2.0      0.0      position = start + delta
    44                                               
    45         1          0.0      0.0      0.0      return position

From what I could find it looks like this is the fastest I can get using loc but maybe there's an alternative that doesn't rely on loc and that would be even faster ?

The main function (which call that one) is applied on a dataframe column with 8M rows so even a small decrease in computing time can result in a lot of spared time.

Thanks in advance for your help !

Florian Bernard
  • 323
  • 2
  • 17
  • Maybe separate that out to three different *searches* so that each successive *search* is looking at a smaller dataset: `x=genome[genome['transcript_id'].values == transcript]; x = x[x['exon_number'].values == str(exon)]; x.loc[:,'Start'].item()`. – wwii Nov 24 '19 at 17:22
  • Why are you using `.values` for your comparisons? – wwii Nov 24 '19 at 17:23
  • Maybe try a query: https://stackoverflow.com/a/46324152/2823755 – wwii Nov 24 '19 at 17:28
  • @wwii I will try the first idea! As for why using .values it's because it is faster since it's a numpy array. – Florian Bernard Nov 24 '19 at 18:37

1 Answers1

1

I solved the problem by replacing my dataframe with a dictionary. I fused both columns (transcript_id and exon_number) into a single one and used it to construct a dictionary. I can then use transcript and exon into a string to look for my Start value.

It now looks like that :

def convert_position(transcript, exon, delta, genome=gtf):

    start = gtf[f'{transcript}.e{exon}']

    position = start + delta

    return position

Here's the profiled code:

Timer unit: 1e-06 s

Total time: 6e-06 s
File: <ipython-input-132-e1495f24a68e>
Function: convert_position at line 41

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    41                                           def convert_position(transcript, exon, delta, genome=gtf):
    42                                               
    43         1          5.0      5.0     83.3      start = gtf[f'{transcript}.e{exon}']
    44                                               
    45         1          1.0      1.0     16.7      position = start + delta
    46                                               
    47         1          0.0      0.0      0.0      return position
Florian Bernard
  • 323
  • 2
  • 17