I'm trying my hand at the Kaggle COVID-19 competition (https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks) just to see if I can help. I have a question about improving the efficiency of regular expression search in a pandas DataFrame.
I've organised the dataset so I have dataframe with a title, abstract and the full text of the article in each row. The goal is to search the full text for keywords using a regular expression, and then return a set in a new column. As a first step, I am searching for the virus mentioned in each article. I am also using a dataset from the International Committee on Taxonomy of Viruses to help me identify the virus (https://talk.ictvonline.org/files/master-species-lists/m/msl/8266)
While I understand that my dataset is large, and there is a lot of data in the "Full text" column (400,000+ lines, and 100s of words in the full text column) , my current script run has ran for 2 days non-stop. I would like to check whether if there is a way to improve it's efficiency, as I want to run other regular expression searches, and preferably not have to wait for so long.
I've created a mock dataset, but with the script I am using. Is there anyway for me to improve it's efficiency?
import pandas as pd
import re
Mock dataset
df = pd.DataFrame(data = {"Title": ["Article 1", "Article 2", "Article 3"],
"Abstract":["Abstract 1", "Abstract 2", "Abstract 3"],
"Text":["Lorem ipsum dolor sit amet, consectetur adipiscing elit, coronavirus sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, papavirus, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia paleovirus deserunt mollit anim id est laborum.",
"Lorem ipsum dolor sit amet, paleovirus consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui coronavirus officia deserunt mollit anim id est laborum.",
"Lorem coronavirus ipsum dolor sit amet, astrovirus consectetur adipiscing elit, sed do eiusmod tempor astrovirus incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."]
})
This is from the ICTV website, I downloaded the spreadsheet from the website, to replicate, please download the spreadsheet, and change the folder path
virus = pd.read_excel(r"D:\Python work\2020-03-13\ICTV Master Species List 2018b.v2.xlsx", sheet_name = "ICTV 2018b Master Species #34 v")
Organising the data
virus = virus[['Realm', 'Subrealm', 'Kingdom', 'Subkingdom', 'Phylum',
'Subphylum', 'Class', 'Subclass', 'Order', 'Suborder', 'Family',
'Subfamily', 'Genus', 'Subgenus', 'Species']]
I've melted the dataset into a single column. Articles may mention the same virus in different conjucations (coronavirus, coronaviridae, coronavirale etc. I want to capture all versions)
virus = virus.melt(id_vars= None, var_name = "virus_class", value_name = "virus_name")
virus.drop_duplicates(subset = ["virus_name"], inplace=True)
virus.dropna(subset = ["virus_name"], inplace=True)
virus["virus_name"] = virus["virus_name"] .apply(lambda x : x.lower())
From my understanding, all virus names will have the stem "vir" in it's name, it can be at the beginning, the middle or in the end.
These lines attempt to capture prefix fof the virus using a regular expressions
virus["tgt"] = virus["virus_name"].apply(lambda x: re.findall("[a-z].*(?=vir)", x))
This converts the list returned from the regular expression to a string.
virus["tgt"] = virus["tgt"].astype(str)
virus["tgt"] = virus["tgt"].apply(lambda x: x.strip())
virus["tgt"] = virus["tgt"].apply(lambda x: x.replace("[",""))
virus["tgt"] = virus["tgt"].apply(lambda x: x.replace("]",""))
virus["tgt"] = virus["tgt"].apply(lambda x: x.replace("'",""))
virus["tgt"] = "[a-z]+" + virus["tgt"] + "vir[a-z0-9]+"
virus["tgt"].drop_duplicates(inplace=True)
This step takes all the virus in panda's series and puts it into a single string. Also, thank you for providing this code (Python: Elegant way to check if at least one regex in list matches a string)
regexes = virus["tgt"].tolist()
combined = "(" + ")|(".join(regexes) + ")"
This is the code I am most worried about. It runs the regular expression, and returns what it finds to a string. The "set" is to remove duplicates
def working(x):
x = set(["".join(x) for x in re.findall(combined, x)])
print(x)
return x
This line runs the code and picks up the text by row. However, as mentioned, this is taking a long time.
df["ID_virus"] = df["Text"].apply(lambda x: working(x))
The script returns what I want, but it is slow
I apologise for the very long entry, but I wanted to provide as much information to recreate the problem. The script works (I think), but as mentioned, the script has been running for two days.
Any help would be appreciated.