0

I want to be able to scrape out a list of links. I cannot due this directly with BeautifulSoup because of the way the html is structured.

start_list = soup.find_all(href=re.compile('id='))

print(start_list)

[<a href="/movies/?id=actofvalor.htm"><b>Act of Valor</b></a>,
 <a href="/movies/?id=actionjackson.htm"><b>Action Jackson</b></a>]

I am looking to pull just the href information. I am thinking some sort of filter where I can put all of the bold tags into a list then filter them out of another list which contains the information above.

start_list = soup.find_all('a', href=re.compile('id='))

start_list_soup = BeautifulSoup(str(start_list), 'html.parser')

things_to_remove = start_list_soup.find_all('b')

The idea is to be able to loop through things_to_remove and remove all occurrences of its contents from start_list

Chace Mcguyer
  • 415
  • 2
  • 7
  • 19

1 Answers1

0
start_list = soup.find_all(href=re.compile('id='))

href_list = [i['href'] for i in start_list]

href is the attrbute of tag, if you use find_all get bunch of tags, just iterate over it and use tag['href'] to access the attribute.

To understand why use [], you should know that tag's attribute are store in the dictionary. Document:

A tag may have any number of attributes. The tag <b class="boldest"> has an attribute “class” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary:

tag['class']
# u'boldest'

You can access that dictionary directly as .attrs:

tag.attrs
# {u'class': u'boldest'}

list comprehension is simple, you can reference this PEP, in this case, it can be done in the for loop:

href_list = []
for i in start_list:
    href_list.append(i['href'])
宏杰李
  • 11,820
  • 2
  • 28
  • 35