How can I write the following code in a more efficient and pythonic way?

Question

I have a list with urls: file_url_list, which prints to this:

www.latimes.com, www.facebook.com, affinitweet.com, ...

And another list of the Top 1M urls: top_url_list, which prints to this:

[1, google.com], [2, www.google.com], [3, microsoft.com], ...

I want to find how many URLs in file_url_list are in top_url_list. I have written the following code which works, but I know that it's not the fastest way to do it, nor the most pythonic one.

# Find the common occurrences
found = []
for file_item in file_url_list:
    for top_item in top_url_list:
        if file_item == top_item[1]:
            # When you find an occurrence, put it in a list
            found.append(top_item)

How can I write this in a more efficient and pythonic way?

Why you are storing a counter as the first element of list? This is actually making things complex. Is there any reason to do this? — Ahsanul Haque, Apr 27 '17 at 08:34
If the goal is to "find how many URLS [...] are in top_url_list", why are you not counting anything? Is there any particular reason why you're appending them to a list? — Aran-Fey, Apr 27 '17 at 08:35
Possible duplicate of [Find intersection of two lists?](http://stackoverflow.com/questions/642763/find-intersection-of-two-lists) — fafl, Apr 27 '17 at 08:35

score 7 · Accepted Answer · edited May 23 '17 at 12:02

7

Set intersection should help. Additionally, you can use a generator expression to extract just the url from each entry in top_url_list.

file_url_list = ['www.latimes.com', 'www.facebook.com', 'affinitweet.com']
top_url_list = [[1, 'google.com'], [2, 'www.google.com'], [3, 'microsoft.com']]

common_urls = set(file_url_list) & set(url for (index, url) in top_url_list)

or equivalently thanks to Jean-François Fabre:

common_urls = set(file_url_list) & {url for (index, url) in top_url_list}

edited May 23 '17 at 12:02

Community

1
1

answered Apr 27 '17 at 08:36

Kos

70,399
25
169
233

3

use a set comprehension: `set(url for (index, url) in top_url_list)` => `{url for (index, url) in top_url_list}` – Jean-François Fabre Apr 27 '17 at 08:38
Extremely fast and elegant. Thank you. – Aventinus Apr 27 '17 at 08:47
1

Do you really need to build 2 sets if you're throwing them away? Maybe `{url for (index, url) in top_url_list}.intersection(file_url_list)`. See [**`intersection`**](https://docs.python.org/2/library/stdtypes.html#set.intersection), it takes an iterable. – Peter Wood Apr 27 '17 at 08:53
1

@PeterWood oh that's a good suggestion. `&` needs two sets, but `intersection` doesn't. – Kos Apr 27 '17 at 09:28

Peter Wood · Answer 2 · 2017-04-27T12:26:21.293

You say you want to know how many urls from the file are in the top 1m list, not what they actually are. Build a set of the larger list (I assume it will be the 1m), and then iterate through the other list counting whether each is in the set:

top_urls = {url for (index, url) in top_url_list}
total = sum(url in top_urls for url in file_url_list)

If the file list is larger build the set from that instead:

file_urls = set(file_url_list)
total = sum(url in file_urls for index, url in top_url_list)

sum will add together numbers. url in top_urls evaluates to a bool, either True or False. This gets converted to an integer, 1 or 0 respectively. url in top_urls for url in file_url_list effectively generates a sequence of 1 or 0 for sum.

Perhaps slightly more efficient (I'd have to test it), you could filter and only sum 1s if url in top_urls:

total = sum(1 for url in file_url_list if url in top_urls)

score 1 · Answer 3 · edited May 23 '17 at 12:18

1

You could take URLs from second list and then either use set as Kos has shown in his answer, or you can use lambda with filter.

top_url_list_flat = [item[1] for item in top_url_list]
print filter(lambda url: url in file_url_list, top_url_list_flat)

In Python 3 filter returns an object which is iterable, so you will have to do below:

for common in (filter(lambda url: url in file_url_list, top_url_list_flat)):
    print (common)

Demo

edited May 23 '17 at 12:18

Community

1
1

answered Apr 27 '17 at 08:34

Chankey Pathak

21,187
12
85
133

This works when you remove the counter from `top_url_list` – fafl Apr 27 '17 at 08:37
But, it's not mentioned anywhere in the answer. – Ahsanul Haque Apr 27 '17 at 08:38
My bad, I didn't notice the second list. – Chankey Pathak Apr 27 '17 at 08:39
`url in file_url_list` performs a linear search – Peter Wood Apr 27 '17 at 09:20

How can I write the following code in a more efficient and pythonic way?

3 Answers3