0

I have a loop that is constantly adding a variable with an unknown value to a list, and then prints the list. However I don't find a way to ignore the values previously found and added to the list when I want to print the list the next time.

I'm scraping a constantly updating website for keyword-matching links using requests and bs4 inside a loop. Once the website added the links I'm looking for my code adds them to a list, and prints the list. Once the website adds the next wave of matching links, these will also be added to my list, however my code will also add the old links found before to the list again since they still match my keyword. Is it possible to ignore these old links?

url = "www.website.com"  
keyword = "news"
results = []                    #list which saves the links 

while True:
        source = requests.get(url).text  
        soup = BeautifulSoup(source, 'lxml')
        options = soup.find_all("a", class_="name-link")      
        for o in options:
            if keyword in o.text:
                link = o.attrs["href"]                 #the links I want                
                results.append(link)                   #adds links to list

        print(results)
        time.sleep(5)                              #wait until next scrape


#so with every loop the value of 'link' is changing which makes it hard         
for me to find a way to ignore previously found links

To maybe make it easier to understand you could think of a loop adding an unknown number to a list with every execution of the loop, but the number should only be printed in the first execution.

Viet NaM
  • 135
  • 2
  • 8

2 Answers2

0

Here is a proof of concept using sets, if the challenge is that you only want to keep unique links, and then print the new links found that have not been found previously:

import random

results = set()
for k in range(15):
    new = {random.randint(1,5)}
    print(f"First Seen: {new-results}")
    results = results.union(new)
    print(f"All: {results}")

If it is more of a streaming issue, where you save all links to a large list, but only want to print the latest ones found you can do something like this:

import random

results = []
for k in range(5):
    n = len(results)
    new = []
    for k in range(random.randint(1,5)):
        new.append(random.randint(1,5))

    results.extend(new)
    print(results[n:])

But then again, you can also just print new in this case....

Jurgen Strydom
  • 3,540
  • 1
  • 23
  • 30
0

This is a good use case for Set data structure. Sets do not maintain any ordering of the items. Very simple change to your code above:

url = "www.website.com"  
keyword = "news"
results = {}

while True:
        source = requests.get(url).text  
        soup = BeautifulSoup(source, 'lxml')
        options = soup.find_all("a", class_="name-link")      
        for o in options:
            if keyword in o.text:
                link = o.attrs["href"]                 #the links I want                
                results.add(link)                   #adds links to list

        print(results)
        time.sleep(5)                              #wait until next scrape

If you want to maintain order, you can use some variation of an ordered dictionary. Please see here: Does Python have an ordered set?

Ankur
  • 31
  • 2