0

I have the following data in Python:

list1=[[ENS_ID1,ENS_ID2,ENS_ID3], [ENS_ID10,ENS_ID24,ENS_ID30] , ....] 

mapping (a dataframe where in the first column I have an Ensemble gene ID and in the second column the corresponding MGI gene ID)

ENS_ID MGI_ID
ENS_ID1 MGI_ID1
ENS_ID2 MGI_ID2

I'm trying to obtain another list of lists where instead of the ENS_ID I have the MGI_ID. To map the IDs I'm using a for cycle nested inside another one, but obviously, it's really slow as an approach. How can I speed it up? Here's the code:

for l in ens_lists:
  mgi = []
  for i in l:
      mgi.append(mapping['MGI_ID'][mapping[mapping['ENSEMBL_ID']==i].index].values[0])
  mgi_lists.append(mgi)
ᴀʀᴍᴀɴ
  • 4,443
  • 8
  • 37
  • 57
Andrea
  • 91
  • 10
  • no idea if its quicker but can you sort them and zip? – Sayse Oct 22 '21 at 07:42
  • Loops in python are very slow. You might look up `multithreading` to speedup performance. – surftijmen Oct 22 '21 at 07:42
  • I was thinking if there's a different way to do it without loops – Andrea Oct 22 '21 at 07:43
  • Could you please elaborate a little more about the structure of `mapping` ? I though `ENS_ID` and `MGI_ID` were simple constant, but the line `mapping['MGI_ID'][mapping[mapping['ENSEMBL_ID']==i].index].values[0]` make me think the structure is more complex than a simple dict... – NiziL Oct 22 '21 at 07:51
  • mapping it's a dataframe where in the first column I have an Ensemble gene ID and in the second column the corresponding MGI gene ID – Andrea Oct 22 '21 at 07:54
  • 1
    @surftijmen Multithreading won't help with pure Python code due to the GIL. – AKX Oct 22 '21 at 07:55

2 Answers2

0

As a quick solution you can try using listcomp instead of append, which should be faster:

mgi_lists = [[mapping['MGI_ID'][mapping[mapping['ENSEMBL_ID']==i].index].values[0] for i in l] for l in ens_lists]

Some explanations of why listcomp is faster are here

abhijat
  • 535
  • 6
  • 12
  • This will probably not be appreciably faster. – AKX Oct 22 '21 at 07:55
  • Thank you for this first solution, but unfortunately as AKX said the difference it's not noticeable – Andrea Oct 22 '21 at 07:59
  • One other small optimisation you could do is extract `mapping['MGI_ID']` out of the loop as it does not seem to change with the loop, so you wouldn't need to pay the cost of dict lookup per iteration, but again the speedup would probably be quite small. – abhijat Oct 22 '21 at 08:03
0

The best solution is to create a fast data structure with only the lookup values, I mean a key/value, a dict can be very fast. After that, you must walk on the inputs and create the lookup-ed version.

import pandas as pd

list1=[['ENS_ID1','ENS_ID2','ENS_ID3'], ['ENS_ID10','ENS_ID3','ENS_ID2'] ] 

mapping = pd.DataFrame({'ENS_ID':['ENS_ID1','ENS_ID2','ENS_ID3','ENS_ID10'], 'MGI_ID':['MGI_ID1','MGI_ID2','MGI_ID2','MGI_ID10']})
    
lookup = dict(mapping[['ENS_ID','MGI_ID']].values)

# This is superfast
mapped_list = []
for l in list1:
    mapped_list.append([lookup[v] for v in l])

print(mapped_list)
# [['MGI_ID1', 'MGI_ID2', 'MGI_ID2'], ['MGI_ID10', 'MGI_ID2', 'MGI_ID2']]

ps: please correct the question with working code.

Glauco
  • 1,385
  • 2
  • 10
  • 20