3

I have a list with urls: file_url_list, which prints to this:

www.latimes.com, www.facebook.com, affinitweet.com, ...

And another list of the Top 1M urls: top_url_list, which prints to this:

[1, google.com], [2, www.google.com], [3, microsoft.com], ...

I want to find how many URLs in file_url_list are in top_url_list. I have written the following code which works, but I know that it's not the fastest way to do it, nor the most pythonic one.

# Find the common occurrences
found = []
for file_item in file_url_list:
    for top_item in top_url_list:
        if file_item == top_item[1]:
            # When you find an occurrence, put it in a list
            found.append(top_item)

How can I write this in a more efficient and pythonic way?

Aventinus
  • 1,322
  • 2
  • 15
  • 33
  • Why you are storing a counter as the first element of list? This is actually making things complex. Is there any reason to do this? – Ahsanul Haque Apr 27 '17 at 08:34
  • 1
    If the goal is to "find how many URLS [...] are in top_url_list", why are you not counting anything? Is there any particular reason why you're appending them to a list? – Aran-Fey Apr 27 '17 at 08:35
  • 1
    Possible duplicate of [Find intersection of two lists?](http://stackoverflow.com/questions/642763/find-intersection-of-two-lists) – fafl Apr 27 '17 at 08:35

3 Answers3

7

Set intersection should help. Additionally, you can use a generator expression to extract just the url from each entry in top_url_list.

file_url_list = ['www.latimes.com', 'www.facebook.com', 'affinitweet.com']
top_url_list = [[1, 'google.com'], [2, 'www.google.com'], [3, 'microsoft.com']]

common_urls = set(file_url_list) & set(url for (index, url) in top_url_list)

or equivalently thanks to Jean-François Fabre:

common_urls = set(file_url_list) & {url for (index, url) in top_url_list}
Community
  • 1
  • 1
Kos
  • 70,399
  • 25
  • 169
  • 233
  • 3
    use a set comprehension: `set(url for (index, url) in top_url_list)` => `{url for (index, url) in top_url_list}` – Jean-François Fabre Apr 27 '17 at 08:38
  • Extremely fast and elegant. Thank you. – Aventinus Apr 27 '17 at 08:47
  • 1
    Do you really need to build 2 sets if you're throwing them away? Maybe `{url for (index, url) in top_url_list}.intersection(file_url_list)`. See [**`intersection`**](https://docs.python.org/2/library/stdtypes.html#set.intersection), it takes an iterable. – Peter Wood Apr 27 '17 at 08:53
  • 1
    @PeterWood oh that's a good suggestion. `&` needs two sets, but `intersection` doesn't. – Kos Apr 27 '17 at 09:28
2

You say you want to know how many urls from the file are in the top 1m list, not what they actually are. Build a set of the larger list (I assume it will be the 1m), and then iterate through the other list counting whether each is in the set:

top_urls = {url for (index, url) in top_url_list}
total = sum(url in top_urls for url in file_url_list)

If the file list is larger build the set from that instead:

file_urls = set(file_url_list)
total = sum(url in file_urls for index, url in top_url_list)

sum will add together numbers. url in top_urls evaluates to a bool, either True or False. This gets converted to an integer, 1 or 0 respectively. url in top_urls for url in file_url_list effectively generates a sequence of 1 or 0 for sum.

Perhaps slightly more efficient (I'd have to test it), you could filter and only sum 1s if url in top_urls:

total = sum(1 for url in file_url_list if url in top_urls)
Peter Wood
  • 23,859
  • 5
  • 60
  • 99
1

You could take URLs from second list and then either use set as Kos has shown in his answer, or you can use lambda with filter.

top_url_list_flat = [item[1] for item in top_url_list]
print filter(lambda url: url in file_url_list, top_url_list_flat)

In Python 3 filter returns an object which is iterable, so you will have to do below:

for common in (filter(lambda url: url in file_url_list, top_url_list_flat)):
    print (common)

Demo

Community
  • 1
  • 1
Chankey Pathak
  • 21,187
  • 12
  • 85
  • 133