3

I have a large list of chemical names (~30,000,000) and a large list of articles (~34,000) in the form of XMLs that are being stored on a server as files.

I am trying to parse every XML as a string for a mention of one or more chemical names. The final result would be a tab-separated text file where I have a file name and then the list of chemicals that appear in the file.

The current issue is that I have a for loop that iterates through all the chemicals inside a for loop that iterates through all the XMLs. Nested inside the for loops is the string in string operation in python. Is there any way to improve the performance by either using a more efficient operation than the string in string or by rearranging the for loops?

My pseudo code:

for article is articles:
         chemicals_in_article = []
         temp_article = article.lower()
         for chemical in chemicals:
               if chemical in temp_article: chemicals_in_article.append(chemical)

         #Write the results into a text file
         output_file.write(article.file_name)
         for chemical in chemicals_in_article: 
               output_file.write("\t" + chemical)
         output_file.write("\n")

               
Justin
  • 53
  • 1
  • 7
  • 1) Parse the XML file in xtree or libxml2; 2) Use xpath to query that result. Fast, easy, accurate. – dawg Jun 01 '21 at 16:53
  • @dawg Likely a good idea, to avoid matching tags and attributes (though due to domain there may be little to no overlap with text). I would still compile a trie and convert to regexp, then use XPath `matches` with it, or it would not be fast. – Amadan Jun 01 '21 at 16:57
  • @Amadan: Agreed. It is not clear from your answer how the XML would be parsed into the Trie however. – dawg Jun 01 '21 at 17:01
  • @dawg I clarified in my answer. – Amadan Jun 01 '21 at 17:51

2 Answers2

2

I am not sure if 30M entries would blow your memory or not, but an approach based on trie would likely be the fastest. There are several packages that implement this in slightly different forms, for example FlashText; or trieregex. Both have examples that are exact match for your scenario.

EDIT: ...at least on plain text. Per comment above, if you want to avoid matching random bits of markup, build a trie then use XPath matches function to find text nodes where trie-derived regexp finds a match. Unfortunately, the main XML library for Python does not support matches (and indeed there are very few libraries around that support XPath 2.0), so this is not very workable.

Since all you need is detecting presence of your keywords anywhere in the text of the document, a viable workaround is to convert XML to text, then employ one of the methods above. Here is an example:

#pip install libxml2-python3 trieregex

from trieregex import TrieRegEx as TRE
from libxml2 import parseDoc
import re


# prepare
words = ['lemon', 'lemons', 'lime', 'limes', 'pomelo', 'pomelos', 'orange', 'oranges', 'citrus', 'citruses']
tre = TRE(*words)
pattern = re.compile(fr"\b{tre.regex()}\b")
# => \b(?:l(?:emons?|imes?)|citrus(?:es)?|oranges?|pomelos?)\b


# search
xml = """
<?xml version="1.0"?>
<recipe>
  <substitute for="lemon">three limes</substitute>
  <substitute for="orange">pomelo</substitute>
</recipe>
""".strip()
doc = parseDoc(xml)
text = doc.getContent()
matches = pattern.findall(text)
print(matches)
# => ['limes', 'pomelo']
doc.freeDoc()

Note that you only need to prepare the regex once; you can then apply it very fast on multiple documents.

Amadan
  • 191,408
  • 23
  • 240
  • 301
1

checkout regex statements. they can sometimes be faster than string in string. There is a bit of a learning curve in trying to use them.

Checkout this SO question and accepted answer for some clues on this.

Andrew Pye
  • 558
  • 3
  • 16