3

i am using the array to store the URL and i need to eliminate the URL the are present more than once in array because i don't need to crawl the same URL again:

self.level = []  # array where the URL are present 
for link in self.soup.find_all('a'):
    self.level.append(link.get('href'))
    print(self.level)

i need to eliminate duplicate URL before crawling this URL.

falsetru
  • 357,413
  • 63
  • 732
  • 636
mans
  • 1,043
  • 1
  • 8
  • 7

1 Answers1

7

Maintain a set of urls:

self.level = set()
for link in self.soup.find_all('a'):
    self.level.add(link.get('href'))
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • 1
    @mans it is just that, by definition, a set is a collection of unique elements. – alecxe Dec 31 '14 at 05:58
  • can you explain set() method what it is ..? – mans Dec 31 '14 at 05:59
  • 1
    @mans it is a way of initializing an empty set, see http://stackoverflow.com/questions/6130374/empty-set-literal-in-python. – alecxe Dec 31 '14 at 06:00
  • 1
    @mans: Please see the Python docs about sets: https://docs.python.org/3/tutorial/datastructures.html#sets and https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset (thanks to P_O_I_S_O_N for suggesting the 1st link). – PM 2Ring Dec 31 '14 at 06:20