Obtain text if its href attribute is in list but prevent obtaining text if its href attribute is a duplicate

Question

I currently have two lists. One contains two anchor elements, both contain the same href, but different text:

list1 = [<a href="link1">'text1'</a>, <a href="link1">'text2'</a>, 
         <a href="link2"><a href="link2"><span class="flagicon">
         <img Img stuff/></span>'text3'</a>, <a href="link2">'text4'</a>]

From this list I have managed to obtain the href links, and then I removed all duplicates. Since there were two href links, and they were the same, one of them was removed. Now my list with unique href links is:

list2 = ['link1','link2']

Now comes the tricky part. I want to use the unique href from my second list, to find the corresponding text in my first list but only once. I used this example to extract only unique href elements while preserving order. I also want to use that to obtain the text belonging to a unique hreffrom list1.

seen_text = set()
seen_text_add = seen_text.add
unique_text = [x.text for x in list1 if list2 in x and not (x in seen or seen_add(x))]

But this just returns an empty list. Can this be done?

EDIT: My expected result is unique_text =['text1','text3']

Can you give us a minimal example with shorter strings, like suppose `list1 = ['1 ab', '2 cd']` and `list2 = ['3 ab', '4 ab']`. Do you want to end up with `['3 ab']`? — Bahrom, Mar 27 '16 at 21:14
Thanks, and could you please also add what your expected result is? — Bahrom, Mar 27 '16 at 21:16
Looks good, but you forgot the quotes around strings in list1, I can work with this too though. — Bahrom, Mar 27 '16 at 21:19
I'll add those, I also extended on the question a bit more. This is what I would want to obtain for my actual project. — , Mar 27 '16 at 21:21
Don't you mean `if x.link in list2` and not the other way around? — Akshat Mahajan, Mar 27 '16 at 21:43

Bahrom · Accepted Answer · 2016-03-28T00:04:05.733

0

Here's how you could do it with a generator (edited for latest example):

import re

list1 = ["<a href='link1'>'text1'</a>",
         "<a href='link1'>'text2'</a>",
         "<a href='link2'><a href='link2'><span class='flagicon'><img Img stuff/></span>'text3'</a>",
         "<a href='link2'>'text4'</a>"]
list2 = ['link1', 'link2', 'link3']


def gen(txt):
    for elem in list1:
        if txt in elem:
            # Grab only the text between a pair of tags (meaning end of tag >text< start of next tag)
            yield re.match('.*>(?P<text>.+)<.*', elem).group('text')

# For each text in list2 create a generator that will yield matching text from list1.
# Call next on that generator to grab the first result only, with default value of "not found"
x = [next(gen(text), "not found") for text in list2]

print(x)
>>> ["'text1'", "'text3'", 'not found'] # Further process the list (get rid of the quotes etc.)

If this still doesn't work, could you please print out the contents of list1 and list2 and paste them here?

edited Mar 28 '16 at 00:04

answered Mar 27 '16 at 21:40

Bahrom

4,752
32
41

This is a very good answer, however it does not work if the link also contains an image like on [this page](http://racing4everyone.eu/2016-australian-carrera-cup/). I know this is very specific, but the code will have to work on both normal links and links with an added picture like this. EDIT: Apparently it doesn't work in general, here is my [code](http://pastebin.com/Xx5H60tf) – Mar 27 '16 at 22:25
Give me a bit till I get home, and I'll edit it. Could you please post another minimal-ish example for me to test it with? – Bahrom Mar 27 '16 at 22:29
General `list1` example: `'text1', 'text2'` I basically need code that can get the text from both the type surrounding 'text1' and the type surrounding 'text2'. I know this is a big ask :( – Mar 27 '16 at 22:33
What's your end goal? Maybe using BeautifulSoup is easier. – Bahrom Mar 27 '16 at 22:37
The end goal is a universal script that obtains all URLs and text's. I would love to use BS, but the tables vary from page to page, so I can not write one script that will find all the data in all the different tables (some use multicolumns, some don't). – Mar 27 '16 at 22:41
Oh so instead of seeing `text1` you want to see ``? Can you add the example from comment and expected output to the question? – Bahrom Mar 27 '16 at 23:04
Basically add your more complicated example to list1 so I have more test cases – Bahrom Mar 27 '16 at 23:06
No I want it to skip that and show me the text1 only. However, it only shows me an error: `final_race_text_noduplicates = [next(re.match('<.*>(?P.+)<.*>', elem).group('text') for elem in race_urls if text in elem) for text in final_race_url_noduplicates] StopIteration` Also get it if I just use `print` – Mar 27 '16 at 23:07
Edited my main post to reflect the problem. – Mar 27 '16 at 23:12
Ah, that's because you can have more than one set of <>s before the text. I think I just need to update the regex. Try changing it from `"<.*>(?P.+)<.*>"` to `"(<.*>)+(?P.+)<.*>"`. I'm on my phone right now, so can't test anything for another 30 mins or so, so let me know if it works for you. – Bahrom Mar 27 '16 at 23:13
Nah, still get the `StopIteration` error. I'ts already 1:30am here. I hope you can find a solution to this, see you in the morning. PS, full code is listed above (ignore the selenium imports for now, they are needed for the final script, rendering some JS modules). – Mar 27 '16 at 23:21
This does create a list containing a text item per list item of `list2`, but they all say `"not found"`, or whatever other string is written down in its place. I'll post the content of the two lists. – Mar 28 '16 at 07:32
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/107551/discussion-between-bah-and-luc-evertzen). – Bahrom Mar 28 '16 at 13:36

Obtain text if its href attribute is in list but prevent obtaining text if its href attribute is a duplicate

1 Answers1