0

I am new to python and have written the following code that runs very slow.

I have debugged the code and found out it is the last re.match() that is causing the code to run very slow. Even though the previous match does the same kind of match against the same DataFrame, it comes back quickly.

Here is the code:

My_Cells = pd.read_csv('SomeFile',index_col = 'Gene/Cell Line(row)').T
My_Cells_Others = pd.DataFrame(index=My_Cells.index,columns=[col for col in My_Cells if re.match('.*\sCN$|.*\sMUT$|^bladder$|^blood$|^bone$|^breast$|^CNS$|^GI tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col)])
My_Cells_Genes = pd.DataFrame(index=My_Cells.index,columns=[col for col in My_Cells if re.match('.*\sCN$|.*\sMUT$|^bladder$|^blood$|^bone$|^breast$|^CNS$|^GI tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col) is None ])
for col in My_Cells.columns:
   if  re.match('.*\sCN$|.*\sMUT$|^bladder$|^blood$|^bone$|^breast$|^CNS$|^GI tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col):
          My_Cells_Others [col] = pd.DataFrame(My_Cells[col])
   if  re.match('.*\sCN$|.*\sMUT$|^bladder$|^blood$|^bone$|^breast$|^CNS$|^GI tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col) is None:
          My_Cells_Genes [col] =  pd.DataFrame(My_Cells[col])

I do not think the problem is related to regular expressions. The code below is still running slow.

for col in My_Cells_Others.columns:
    if (col in lst) or col.endswith(' CN') or col.endswith(' MUT'):
          My_Cells_Others [col] = My_Cells[col]
for col in My_Cells_Genes.columns:
    if  not ((col in lst) or col.endswith(' CN') or col.endswith(' MUT')):
        My_Cells_Genes [col] =  My_Cells[col]
user
  • 5,370
  • 8
  • 47
  • 75
  • 1
    What about `if col.endswith('CN') or col.endswith('MUT') or col in ['bladder','blood','bone',...]:` – jedwards Apr 26 '15 at 10:45
  • You could compile the regex like this `p = re.compile(ur'.*\sCN$|.*\sMUT$|^bladder$|^blood$|^bone$|^breast$|^CNS$|^GI tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$')` *outside* the loop. Then, use as `if (p.match(col))`... – Wiktor Stribiżew Apr 26 '15 at 11:21
  • specifcally , the 2nd for loop above. The data frame is large , about ~14000 columns but I am not sure if this is the reason – user1050702 Apr 27 '15 at 04:28
  • @user1050702 What are the actual times you get? – user Apr 27 '15 at 08:23
  • it takes at least 15 Mins to iterate over both data frames but it finishes eventually :) – user1050702 Apr 27 '15 at 08:29
  • @user1050702 What are the actual times for both? Is it for example 12 minutes vs 15? – user Apr 27 '15 at 09:02
  • If I do not construct My_Cells_Genes , it takes no time . Only when I try to generate My_Cells_Genes it takes ~ 15 Minutes. The number of columns extracted in the case of My_Cells_Others is much fewer that in the case My_Cells_Genes. I hope that I answered your question. – user1050702 Apr 27 '15 at 19:00
  • Possible duplicate of [Why does this take so long to match? Is it a bug?](http://stackoverflow.com/questions/25982466/why-does-this-take-so-long-to-match-is-it-a-bug) – user Jan 21 '16 at 12:09

1 Answers1

0

"Poorly" designed regular expressions can be unnecessarily slow.

My guess is that .*\sCN and *\sMUT combined with a big string that does not match, makes it that slow, since it forces your script to check all possible combinations.


As @jedwards said, you can replace this piece of code

if  re.match('.*\sCN$|.*\sMUT$|^bladder$|^blood$|^bone$|^breast$|^CNS$|^GI tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col):
          My_Cells_Others [col] = pd.DataFrame(My_Cells[col])

with:

lst = ['bladder', 'blood', 'bone', 'breast', 'CNS', 'GI tract', 'kidney', 'lung', 'other', 'ovary', 'pancreas', 'skin',
       'soft tissue', 'thyroid', 'upper aerodigestive', 'uterus']

if (col in lst) or col.endswith(' CN') or col.endswith(' MUT'):
    # Do stuff

Alternatively, if you want to use re for some reason, moving .*\sCN and *\sMUT to the end of the regex might help, depending on your data, since it will not be forced to check all those combinations unless really necessary.

user
  • 5,370
  • 8
  • 47
  • 75