1

I want to get href from a = soup.find_all('div', class_='email-messages').

[<div class="email-messages">
<table>
<tr>
<td id="email-title">Message Title</td>
<td id="email-sender">Sender</td>
<td id="email-control">Control </td>
</tr>
<tr>
<td><a href="/en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1">Fwd: [Microsoft Academic Verification] Confirming Your Academic Status</a></td>
<td id="email-sender"><span data-cf-modified-c9b86b506f187bfdc48368eb-="" onclick="if (!window.__cfRLUnblockHandlers) return false; show_sender_email(this, 'guidetuanhp@gmail.com')" style="cursor: pointer;">Tuấn Anh Vũ</span></td>
<td id="email-control"><a data-cf-modified-c9b86b506f187bfdc48368eb-="" href="/en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1/delete" onclick="if (!window.__cfRLUnblockHandlers) return false; return delete_mail('/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1/delete');">[Delete]</a></td>
</tr>
<tr>
<td class="mail_message_counter" colspan="3">Total Messages: <strong>1</strong></td>
</tr>
</table>
</div>]

My code:

soup = BeautifulSoup(html_doc, 'lxml')
a = soup.find_all('div', class_='email-messages')
for link in a:
    print(link['href'])

I got error:

in __getitem__
    return self.attrs[key]
KeyError: 'href'
martineau
  • 119,623
  • 25
  • 170
  • 301
guidetuanhp
  • 43
  • 1
  • 8

2 Answers2

2

For "single-purpose" scraping it is quite useful to make use of parser customization, SoupStrainer. It is faster (or it should be!) since it localize only the desired portion of the document to be scraped. Details here.

The SoupStrainer instance must always passed as key-value pair of a BeautifulSoup instance with key parse_only:

from bs4 import BeautifulSoup, SoupStrainer

html_doc = # see above

soup = BeautifulSoup(html_doc, 'lxml', parse_only=SoupStrainer('a', href=True))
for tag in soup:
    print(tag['href'])

Output

/en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1
/en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1/delete

Remember

  1. the soup is "strained" and you will deal with a soupobject and not with a list. So the loop variable is a bs4.element.Tagobject!
  2. SoupStrainer has the same signature of the find_all method
cards
  • 3,936
  • 1
  • 7
  • 25
1

You're trying to get "href" from the <div> tag. Try to find all <a> tags inside the <div>s:

from bs4 import BeautifulSoup

html_doc = """<div class="email-messages">
<table>
<tr>
<td id="email-title">Message Title</td>
<td id="email-sender">Sender</td>
<td id="email-control">Control </td>
</tr>
<tr>
<td><a href="/en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1">Fwd: [Microsoft Academic Verification] Confirming Your Academic Status</a></td>
<td id="email-sender"><span data-cf-modified-c9b86b506f187bfdc48368eb-="" onclick="if (!window.__cfRLUnblockHandlers) return false; show_sender_email(this, 'guidetuanhp@gmail.com')" style="cursor: pointer;">Tuấn Anh Vũ</span></td>
<td id="email-control"><a data-cf-modified-c9b86b506f187bfdc48368eb-="" href="/en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1/delete" onclick="if (!window.__cfRLUnblockHandlers) return false; return delete_mail('/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1/delete');">[Delete]</a></td>
</tr>
<tr>
<td class="mail_message_counter" colspan="3">Total Messages: <strong>1</strong></td>
</tr>
</table>
</div>"""

soup = BeautifulSoup(html_doc, "html.parser")


divs = soup.find_all("div", class_="email-messages")
for div in divs:
    for link in div.find_all("a"):
        print(link["href"])

Prints:

/en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1
/en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1/delete
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91