If you need only scrape specific lines, you need get these line before you scrape it. I'd suggest use str.splitlines()
and a list slice to get them.
For example:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://www.yahoo.com')
>>> print('\n'.join(r.text.splitlines()[575:634]))
The output is:
<li class="D(b)">
<a href="https://www.yahoo.com/politics/" class="D(b) Fz(13px) C($topbarMenu) Py(3px) Td(n) Td(u):h" data-ylk="slk:politics;t5:politics;cpos:9;" tabindex="1">Politics</a>
</li>
<li class="D(b)">
<a href="https://www.yahoo.com/celebrity/" class="D(b) Fz(13px) C($topbarMenu) Py(3px) Td(n) Td(u):h" data-ylk="slk:celebrity;t5:celebrity;cpos:10;" tabindex="1">Celebrity</a>
</li>
...
<li class="D(b)">
<a href="https://www.yahoo.com/travel/" class="D(b) Fz(13px) C($topbarMenu) Py(3px) Td(n) Td(u):h" data-ylk="slk:travel;t5:travel;cpos:22;" tabindex="1">Travel</a>
</li>
<li class="D(b)">
<a href="https://www.yahoo.com/autos/" class="D(b) Fz(13px) C($topbarMenu) Py(3px) Td(n) Td(u):h" data-ylk="slk:autos;t5:autos;cpos:23;" tabindex="1">Autos</a>
</li>
r.text.splitlines()
split the HTML source code by lines, and gives a list.
[575:634]
is a list slice, which slices the list, and gives lines from 576 to 634. I added two more lines because without them, the output will be:
<a href="https://www.yahoo.com/politics/" class="D(b) Fz(13px) C($topbarMenu) Py(3px) Td(n) Td(u):h" data-ylk="slk:politics;t5:politics;cpos:9;" tabindex="1">Politics</a>
</li>
<li class="D(b)">
<a href="https://www.yahoo.com/celebrity/" class="D(b) Fz(13px) C($topbarMenu) Py(3px) Td(n) Td(u):h" data-ylk="slk:celebrity;t5:celebrity;cpos:10;" tabindex="1">Celebrity</a>
</li>
...
<li class="D(b)">
<a href="https://www.yahoo.com/travel/" class="D(b) Fz(13px) C($topbarMenu) Py(3px) Td(n) Td(u):h" data-ylk="slk:travel;t5:travel;cpos:22;" tabindex="1">Travel</a>
</li>
<li class="D(b)">
<a href="https://www.yahoo.com/autos/" class="D(b) Fz(13px) C($topbarMenu) Py(3px) Td(n) Td(u):h" data-ylk="slk:autos;t5:autos;cpos:23;" tabindex="1">Autos</a>
And which isn't a valid HTML code block.
'\n'.join()
joins the list by \n
, and gives another string which you want.
After we have the specific lines :
>>> soup = BeautifulSoup('\n'.join(r.text.splitlines()[575:634]), 'html.parser')
>>> for i in soup.find_all('a'):
... print(i.get('href'))
...
...
https://www.yahoo.com/politics/
https://www.yahoo.com/celebrity/
https://www.yahoo.com/movies/
https://www.yahoo.com/music/
https://www.yahoo.com/tv/
https://www.yahoo.com/health/
https://www.yahoo.com/style/
https://www.yahoo.com/beauty/
https://www.yahoo.com/food/
https://www.yahoo.com/parenting/
https://www.yahoo.com/makers/
https://www.yahoo.com/tech/
https://shopping.yahoo.com/
https://www.yahoo.com/travel/
https://www.yahoo.com/autos/
soup.find_all('a')
finds all the <a>
HTML tags in the string (HTML code block) we have, and gives a list of these tags.
Then, we use for
loop over the list, and use i.get('href')
to get the href
attribute (the link you want) of the <a>
tag.
You can also use a list comprehension to put the result into a list, rather than print it out:
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.yahoo.com')
soup = BeautifulSoup('\n'.join(r.text.splitlines()[575:634]), 'html.parser')
l = [i.get('href') for i in soup.find_all('a')]
l
is the list which you're looking for.
If you also want get the title of these links, you can use i.text
to get it. However, there's no table object in Python, I think you mean dict
:
>>> d = {i.text: i.get('href') for i in soup.find_all('a')}
>>> pprint(d)
{'Autos': 'https://www.yahoo.com/autos/',
'Beauty': 'https://www.yahoo.com/beauty/',
'Celebrity': 'https://www.yahoo.com/celebrity/',
'Food': 'https://www.yahoo.com/food/',
'Health': 'https://www.yahoo.com/health/',
'Makers': 'https://www.yahoo.com/makers/',
'Movies': 'https://www.yahoo.com/movies/',
'Music': 'https://www.yahoo.com/music/',
'Parenting': 'https://www.yahoo.com/parenting/',
'Politics': 'https://www.yahoo.com/politics/',
'Shopping': 'https://shopping.yahoo.com/',
'Style': 'https://www.yahoo.com/style/',
'TV': 'https://www.yahoo.com/tv/',
'Tech': 'https://www.yahoo.com/tech/',
'Travel': 'https://www.yahoo.com/travel/'}
>>> d['TV']
'https://www.yahoo.com/tv/'
>>> d['Food']
'https://www.yahoo.com/food/'
So you can use {i.text: i.get('href') for i in soup.find_all('a')}
to get the dict you want.
In this case, i.text
(title) is the keys in that dict, for example 'TV'
and 'Food'
.
And i.get('href')
is the value (links), for example 'https://www.yahoo.com/tv/'
and 'https://www.yahoo.com/food/'
.
You can access the value by d[key]
as my code above.