How to speed up this Pandas for loop

Question

I have the following data in Python:

list1=[[ENS_ID1,ENS_ID2,ENS_ID3], [ENS_ID10,ENS_ID24,ENS_ID30] , ....]

mapping (a dataframe where in the first column I have an Ensemble gene ID and in the second column the corresponding MGI gene ID)

ENS_ID	MGI_ID
ENS_ID1	MGI_ID1
ENS_ID2	MGI_ID2

I'm trying to obtain another list of lists where instead of the ENS_ID I have the MGI_ID. To map the IDs I'm using a for cycle nested inside another one, but obviously, it's really slow as an approach. How can I speed it up? Here's the code:

for l in ens_lists:
  mgi = []
  for i in l:
      mgi.append(mapping['MGI_ID'][mapping[mapping['ENSEMBL_ID']==i].index].values[0])
  mgi_lists.append(mgi)

Loops in python are very slow. You might look up `multithreading` to speedup performance. — surftijmen, Oct 22 '21 at 07:42
I was thinking if there's a different way to do it without loops — Andrea, Oct 22 '21 at 07:43
Could you please elaborate a little more about the structure of `mapping` ? I though `ENS_ID` and `MGI_ID` were simple constant, but the line `mapping['MGI_ID'][mapping[mapping['ENSEMBL_ID']==i].index].values[0]` make me think the structure is more complex than a simple dict... — NiziL, Oct 22 '21 at 07:51
mapping it's a dataframe where in the first column I have an Ensemble gene ID and in the second column the corresponding MGI gene ID — Andrea, Oct 22 '21 at 07:54
@surftijmen Multithreading won't help with pure Python code due to the GIL. — AKX, Oct 22 '21 at 07:55

score 0 · Answer 1 · answered Oct 22 '21 at 07:53

0

As a quick solution you can try using listcomp instead of append, which should be faster:

mgi_lists = [[mapping['MGI_ID'][mapping[mapping['ENSEMBL_ID']==i].index].values[0] for i in l] for l in ens_lists]

Some explanations of why listcomp is faster are here

answered Oct 22 '21 at 07:53

abhijat

535
6
12

This will probably not be appreciably faster. – AKX Oct 22 '21 at 07:55
Thank you for this first solution, but unfortunately as AKX said the difference it's not noticeable – Andrea Oct 22 '21 at 07:59
One other small optimisation you could do is extract `mapping['MGI_ID']` out of the loop as it does not seem to change with the loop, so you wouldn't need to pay the cost of dict lookup per iteration, but again the speedup would probably be quite small. – abhijat Oct 22 '21 at 08:03

score 0 · Accepted Answer · answered Oct 22 '21 at 14:26

The best solution is to create a fast data structure with only the lookup values, I mean a key/value, a dict can be very fast. After that, you must walk on the inputs and create the lookup-ed version.

import pandas as pd

list1=[['ENS_ID1','ENS_ID2','ENS_ID3'], ['ENS_ID10','ENS_ID3','ENS_ID2'] ] 

mapping = pd.DataFrame({'ENS_ID':['ENS_ID1','ENS_ID2','ENS_ID3','ENS_ID10'], 'MGI_ID':['MGI_ID1','MGI_ID2','MGI_ID2','MGI_ID10']})
    
lookup = dict(mapping[['ENS_ID','MGI_ID']].values)

# This is superfast
mapped_list = []
for l in list1:
    mapped_list.append([lookup[v] for v in l])

print(mapped_list)
# [['MGI_ID1', 'MGI_ID2', 'MGI_ID2'], ['MGI_ID10', 'MGI_ID2', 'MGI_ID2']]

ps: please correct the question with working code.

How to speed up this Pandas for loop

2 Answers2