Removing unwanted html from an href tag in Python

Question

I want to be able to scrape out a list of links. I cannot due this directly with BeautifulSoup because of the way the html is structured.

start_list = soup.find_all(href=re.compile('id='))

print(start_list)

[<a href="/movies/?id=actofvalor.htm"><b>Act of Valor</b></a>,
 <a href="/movies/?id=actionjackson.htm"><b>Action Jackson</b></a>]

I am looking to pull just the href information. I am thinking some sort of filter where I can put all of the bold tags into a list then filter them out of another list which contains the information above.

start_list = soup.find_all('a', href=re.compile('id='))

start_list_soup = BeautifulSoup(str(start_list), 'html.parser')

things_to_remove = start_list_soup.find_all('b')

The idea is to be able to loop through things_to_remove and remove all occurrences of its contents from start_list

宏杰李 · Accepted Answer · 2017-01-02T03:32:11.753

0

start_list = soup.find_all(href=re.compile('id='))

href_list = [i['href'] for i in start_list]

href is the attrbute of tag, if you use find_all get bunch of tags, just iterate over it and use tag['href'] to access the attribute.

To understand why use [], you should know that tag's attribute are store in the dictionary. Document:

A tag may have any number of attributes. The tag <b class="boldest"> has an attribute “class” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary:
tag['class']
# u'boldest'
You can access that dictionary directly as .attrs:
tag.attrs
# {u'class': u'boldest'}

list comprehension is simple, you can reference this PEP, in this case, it can be done in the for loop:

href_list = []
for i in start_list:
    href_list.append(i['href'])

edited Jan 02 '17 at 03:32

answered Jan 02 '17 at 02:53

宏杰李

11,820
2
28
35

This is exactly what I needed can you explain the list comprehension a little more to me? – Chace Mcguyer Jan 02 '17 at 03:09
specifically: this part i['href'] why is it in brackets? – Chace Mcguyer Jan 02 '17 at 03:11
@ Chace Mcguyer please accept this answer to close this question. – 宏杰李 Jan 02 '17 at 11:26

Removing unwanted html from an href tag in Python

1 Answers1