Generate non-duplicate list in python

Question

I am trying to generate a list that contains anchor names in w:hyperlink elements by looping over all document's elements using the python-docx library, with this code:

def get_hyperlinks(docx__document):
    hyperlinks_in_document = list()
    for counter, element in enumerate(docx__document.elements):
        if isinstance(element, Paragraph) and not element.is_heading:
            hyperlinks_in_document.extend(element._element.xpath('//w:hyperlink/@w:anchor'))
    return list(set(hyperlinks_in_document))

The above code returns a list with anchors found the issue I'm having is when a text is separated into multiple runs therefore a list "generated from looping into element" can have duplicated names and the output is being like this:

['American', 'Syrian', 'American', 'Syrian', 'American', 'Syrian', 'American', 'Syrian']

I tried these codes from here but still with the issue of duplicate or performance of code is affected but this code here:

def get_hyperlinks(docx__document):
    hyperlinks_in_document = list()
    returned_links = list()
    for counter, element in enumerate(docx__document.elements):
        if isinstance(element, Paragraph) and not element.is_heading:
            hyperlinks_in_document.extend(element._element.xpath('//w:hyperlink/@w:anchor'))
            [returned_links.append(element_in_list) for element_in_list in hyperlinks_in_document
             if element_in_list not in returned_links]
    return returned_links

solve the issue of duplicate but the performance is affected. any ideas that can help?

Why can't you use only a `set` and do set union? That will solve the duplication and the performance problem. — nonDucor, Mar 04 '22 at 09:42

score 0 · Accepted Answer · answered Mar 06 '22 at 11:02

I made changes with the previous code and figured out to switch the final list to set therefore I got non-duplicate items with less time:

def get_hyperlinks(docx__document):    
    hyperlinks, returned_links = list(), set()
    for counter, element in enumerate(docx__document.elements):
        if isinstance(element, Paragraph) and not element.is_heading:
            hyperlinks = element._p.getparent().xpath('.//w:hyperlink')
    hyperlinks = [str(hyperlink.get(qn("w:anchor"))) for hyperlink in hyperlinks]
    returned_links = list(set().union(hyperlinks))
    # [returned_links.append(element_in_list) for element_in_list in hyperlinks
    #          if element_in_list not in returned_links]
    return returned_links

Commented lines show what I did before and the whole answer is the final code.

Generate non-duplicate list in python

1 Answers1