0

I'm crawling a series of webpages and organising their content into an in-memory knowledge base. I need to execute different code depending on my string input, which is crawled from a website's headings.

tags = browser.find_elements_by_xpath("//div[@class='main-content-entry']/h2")
for tag in tags:
  heading = tag.get_attribute("textContent").lower().strip()
  content = tag.parent
  if heading.find("overview") != -1:
    # do this
  elif heading.find("takeaways") != -1:
    # do that
  # do more elifs
  else:
    # do something else

Right now, I have it implemented as an if-elif-else statement. I've seen answers around the site suggesting the use of dicts, but from what I can tell that's dependent on the input being an exact match to the key. In my case, however, exact matches are not always possible due to inconsistencies on the website owner's part.

The pages are structured enough that I know what the heading names are, so I can define the "keys" in advance in my code. However, there are typos and slight variants in some of the hundred-over pages for some headings. For example:

  • Fees & Funding
  • Fees
  • Fees &Funding
  • Certificates
  • Certificate
  • Certificat & Exams
  • Exams & Certificates

The best I can do, as I currently am, is to do a first scan through the pages, identify the entire set of headings, then manually define the substrings to use in my code that would avoid repetitiveness.

Considering the above, is there a better way then to iteratively execute a chained if-elif-else statement?

Edit

The suggested answers in Replacements for switch statement in Python? don't work in my situation. Take for example:

def do_this(heading):
  return {
    "overview": do_overview(),
    "fees": do_fees(),
    # ...
  }[heading]

This would have been the suggested implementation by that question's answers. But how do I return do_fees() when heading is "fees & funding", "fees", "fees &funding" etc. etc.? I need to execute the correct function if the key value is a substring of heading.

thegreatjedi
  • 2,788
  • 4
  • 28
  • 49

2 Answers2

2

Considering the above, is there a better way then to iteratively execute a chained if-elif-else statement?

There's no requirement for you to directly look up values from the dictionary using specific keys. You can just use a dictionary to condense your parsing logic:

def extract_overview(content):
    ...

def extract_takeaways(content):
    ...

EXTRACTORS = {
    'overview': extract_overview,
    'takeaways': extract_takeaways,
    ...
}

for tag in tags:
    heading = tag.get_attribute("textContent").lower().strip()
    content = tag.parent

    for substring, function in EXTRACTORS.items():
        if substring in heading:
            result = function(content)
            break
    else:
        # No extractor was found
Blender
  • 289,723
  • 53
  • 439
  • 496
  • Thanks! This definitely looks much cleaner. I guess there's just no way to not use looping for Python alternatives to switch huh – thegreatjedi May 19 '19 at 06:04
  • 1
    @thegreatjedi: If you pretend Python has a C-like `switch` statement, how would it simplify your code? A C-like `switch` acts (more or less) like the `do_this` example you have in your question. – Blender May 19 '19 at 06:09
-1

If you want to match typo-ed strings, then you will need some kind of fuzzy matching, for some of your inputs. However for those that are well formed, you can get the linear time advantages of a switch statement by tweaking the dictionary approach. (this only matters if you have a lot of cases).

funcs = {
    "certificates": lambda: "certificates",
    "fees": lambda: "fees",
}

headings =['Fees & Funding', 'Fees', 'Fees &Funding', 'Certificates',
           'Certificate', 'Certificat & Exams', 'Exams & Certificates']

def do_this(heading):
    words = heading.lower().split()
    funcs_to_call = [funcs[word] for word in words if word in funcs]
    if len(funcs_to_call) == 1:
        return funcs_to_call[0]()
    elif len(funcs_to_call) == 0:
        return 'needs fuzzy matching'
    else:
        #alternatively treat it as being in multiple categories.
        raise ValueError("This fits more than one category")


for heading in headings:
    print(heading, parse(heading), sep = ': ')
#outputs:
Fees & Funding: fees
Fees: fees
Fees &Funding: fees
Certificates: certificates
Certificate: needs fuzzy matching
Certificat & Exams: needs fuzzy matching
Exams & Certificates: certificates

If you are able to predict the kinds of typos you are going to face, you could clean the strings more in advance to have more exact matches, such as removing symbols and making words plural.

Will
  • 415
  • 8
  • 15