1

I am trying to generate a list that contains anchor names in w:hyperlink elements by looping over all document's elements using the python-docx library, with this code:

def get_hyperlinks(docx__document):
    hyperlinks_in_document = list()
    for counter, element in enumerate(docx__document.elements):
        if isinstance(element, Paragraph) and not element.is_heading:
            hyperlinks_in_document.extend(element._element.xpath('//w:hyperlink/@w:anchor'))
    return list(set(hyperlinks_in_document))

The above code returns a list with anchors found the issue I'm having is when a text is separated into multiple runs therefore a list "generated from looping into element" can have duplicated names and the output is being like this:

['American', 'Syrian', 'American', 'Syrian', 'American', 'Syrian', 'American', 'Syrian']

I tried these codes from here but still with the issue of duplicate or performance of code is affected but this code here:

def get_hyperlinks(docx__document):
    hyperlinks_in_document = list()
    returned_links = list()
    for counter, element in enumerate(docx__document.elements):
        if isinstance(element, Paragraph) and not element.is_heading:
            hyperlinks_in_document.extend(element._element.xpath('//w:hyperlink/@w:anchor'))
            [returned_links.append(element_in_list) for element_in_list in hyperlinks_in_document
             if element_in_list not in returned_links]
    return returned_links

solve the issue of duplicate but the performance is affected. any ideas that can help?

Ahmad
  • 1,618
  • 5
  • 24
  • 46
  • 1
    Why can't you use only a `set` and do set union? That will solve the duplication and the performance problem. – nonDucor Mar 04 '22 at 09:42

1 Answers1

0

I made changes with the previous code and figured out to switch the final list to set therefore I got non-duplicate items with less time:

def get_hyperlinks(docx__document):    
    hyperlinks, returned_links = list(), set()
    for counter, element in enumerate(docx__document.elements):
        if isinstance(element, Paragraph) and not element.is_heading:
            hyperlinks = element._p.getparent().xpath('.//w:hyperlink')
    hyperlinks = [str(hyperlink.get(qn("w:anchor"))) for hyperlink in hyperlinks]
    returned_links = list(set().union(hyperlinks))
    # [returned_links.append(element_in_list) for element_in_list in hyperlinks
    #          if element_in_list not in returned_links]
    return returned_links

Commented lines show what I did before and the whole answer is the final code.

Ahmad
  • 1,618
  • 5
  • 24
  • 46