I have a large list of chemical names (~30,000,000) and a large list of articles (~34,000) in the form of XMLs that are being stored on a server as files.
I am trying to parse every XML as a string for a mention of one or more chemical names. The final result would be a tab-separated text file where I have a file name and then the list of chemicals that appear in the file.
The current issue is that I have a for loop that iterates through all the chemicals inside a for loop that iterates through all the XMLs. Nested inside the for loops is the string in string
operation in python. Is there any way to improve the performance by either using a more efficient operation than the string in string
or by rearranging the for loops?
My pseudo code:
for article is articles:
chemicals_in_article = []
temp_article = article.lower()
for chemical in chemicals:
if chemical in temp_article: chemicals_in_article.append(chemical)
#Write the results into a text file
output_file.write(article.file_name)
for chemical in chemicals_in_article:
output_file.write("\t" + chemical)
output_file.write("\n")