0

I'm trying to get URLs of a particular website. I'm in the process of it. Can anyone help me? I'm able to delete some of the elements in the list(linkaddresses) after appending it. First i'm taking all urls from particular website (swiggy.com here). After i'm trying to delete list elements( linkaddresses) starting with '/'. When i run the below programme, It is only deleting some of them. In programme itself i printed all the list (linkaddresses) elements before and after modification

below is code in python:

import urllib from urllib import request from bs4 import BeautifulSoup

def linkgetter(searchlink):
    pagesource = urllib.request.urlopen(searchlink)
    linkaddresses = []
    soup = BeautifulSoup(pagesource,'lxml')
    for link in soup.findAll('a'):
        if link.get('href') == None:
            continue
        else:
            linkaddresses.append(link.get('href'))
    print(linkaddresses)
    for i in linkaddresses:
        if i.startswith('#'):
            linkaddresses.remove(i)
        elif i.startswith('/'):
            linkaddresses.append(searchlink+i)
            linkaddresses.remove(i)
    print('\n')
    print('\n')
    print('\n')

    print(linkaddresses)
linkgetter('https://www.swiggy.com')
Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
Bhargav K
  • 3
  • 1
  • 5
    As general advice, you should avoid modifying lists that you are looping through – Jules Mar 10 '20 at 10:27
  • 1
    Does this answer your question? [How to remove items from a list while iterating?](https://stackoverflow.com/questions/1207406/how-to-remove-items-from-a-list-while-iterating) – dspencer Mar 10 '20 at 10:28

1 Answers1

2

As mentioned in the comments, modifying lists you are looping through is a bad idea! You can either populate a new list with the values, or list comprehension can be your friend here :)

https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions

I've broken your for loop into 2 comprehensions. First to filter out anything that starts with a #

linkaddresses = [x for x in linkaddresses if not x.startswith('#')]

Second to then append the link address to anything starting with a /

linkaddresses = [searchlink+x if x.startswith('/') else x for x in linkaddresses]

Full code is now

def linkgetter(searchlink):
    pagesource = urllib.request.urlopen(searchlink)
    linkaddresses = []
    soup = BeautifulSoup(pagesource,'lxml')
    for link in soup.findAll('a'):
        if link.get('href') == None:
            continue
        else:
            linkaddresses.append(link.get('href'))
    print(linkaddresses)

    linkaddresses = [x for x in linkaddresses if not x.startswith('#')]
    linkaddresses = [searchlink+x if x.startswith('/') else x for x in linkaddresses]

    print('\n')
    print(linkaddresses)

linkgetter('https://www.swiggy.com')
Rob P
  • 190
  • 1
  • 11