I am new to python and have written the following code that runs very slow.
I have debugged the code and found out it is the last re.match()
that is causing the code to run very slow. Even though the previous match does the same kind of match against the same DataFrame, it comes back quickly.
Here is the code:
My_Cells = pd.read_csv('SomeFile',index_col = 'Gene/Cell Line(row)').T
My_Cells_Others = pd.DataFrame(index=My_Cells.index,columns=[col for col in My_Cells if re.match('.*\sCN$|.*\sMUT$|^bladder$|^blood$|^bone$|^breast$|^CNS$|^GI tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col)])
My_Cells_Genes = pd.DataFrame(index=My_Cells.index,columns=[col for col in My_Cells if re.match('.*\sCN$|.*\sMUT$|^bladder$|^blood$|^bone$|^breast$|^CNS$|^GI tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col) is None ])
for col in My_Cells.columns:
if re.match('.*\sCN$|.*\sMUT$|^bladder$|^blood$|^bone$|^breast$|^CNS$|^GI tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col):
My_Cells_Others [col] = pd.DataFrame(My_Cells[col])
if re.match('.*\sCN$|.*\sMUT$|^bladder$|^blood$|^bone$|^breast$|^CNS$|^GI tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col) is None:
My_Cells_Genes [col] = pd.DataFrame(My_Cells[col])
I do not think the problem is related to regular expressions. The code below is still running slow.
for col in My_Cells_Others.columns:
if (col in lst) or col.endswith(' CN') or col.endswith(' MUT'):
My_Cells_Others [col] = My_Cells[col]
for col in My_Cells_Genes.columns:
if not ((col in lst) or col.endswith(' CN') or col.endswith(' MUT')):
My_Cells_Genes [col] = My_Cells[col]