1

I am new at Python and trying to learn how to use BeautifulSoup to scrape a webpage. For starters, I was just using yahoo.com's HTML code:

view-source:https://www.yahoo.com/

I wanted to scrape the list of links starting on row 577 and ending at 633 and get their URL and the title and put it in table in Python.

So far, I have the following:

from bs4 import BeautifulSoup

myURL = "http://www.yahoo.com"
myPage = requests.get(myURL)

yahoo = BeautifulSoup(myPage.content)

print yahoo.prettify()

YahooList = yahoo.find('ul', class_="Pos(r) Miw(1000px) Pstart(9px) Lh(1.7) Reader-open_Op(0) mini-header_Op(0)")
print YahooList

I am unsure of how to proceed further from this. All the examples I am finding are for web scraping from tables but I am not finding much on how to do it on a list.

Does anyone have any suggestions?

Thanks, Nick

pp_
  • 3,435
  • 4
  • 19
  • 27
Nick Johnson
  • 85
  • 3
  • 9
  • What do you mean about *I wanted to scrape the list of links starting on row 577 and ending at 633* ? Do you mean *scrape all the link from the 577 line of the HTML source code, to 633 line* ? – Remi Guan Feb 09 '16 at 02:36
  • Yes, thats exactly what I mean. From the entire webpage, I just want to scrape those specific lines. My apologies if that was unclear in my post. – Nick Johnson Feb 09 '16 at 02:39
  • Variable names like `myURL, YahooList` are not Pythonic, PEP-8 recommended names would be `my_url`, `yahoo_list` etc. – smci Dec 29 '18 at 05:33

1 Answers1

1

If you need only scrape specific lines, you need get these line before you scrape it. I'd suggest use str.splitlines() and a list slice to get them.

For example:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://www.yahoo.com')
>>> print('\n'.join(r.text.splitlines()[575:634]))

The output is:

<li class="D(b)">
    <a href="https://www.yahoo.com/politics/" class="D(b) Fz(13px) C($topbarMenu) Py(3px) Td(n) Td(u):h"   data-ylk="slk:politics;t5:politics;cpos:9;" tabindex="1">Politics</a>
</li>

<li class="D(b)">
    <a href="https://www.yahoo.com/celebrity/" class="D(b) Fz(13px) C($topbarMenu) Py(3px) Td(n) Td(u):h"   data-ylk="slk:celebrity;t5:celebrity;cpos:10;" tabindex="1">Celebrity</a>
</li>

...

<li class="D(b)">
    <a href="https://www.yahoo.com/travel/" class="D(b) Fz(13px) C($topbarMenu) Py(3px) Td(n) Td(u):h"   data-ylk="slk:travel;t5:travel;cpos:22;" tabindex="1">Travel</a>
</li>

<li class="D(b)">
    <a href="https://www.yahoo.com/autos/" class="D(b) Fz(13px) C($topbarMenu) Py(3px) Td(n) Td(u):h"   data-ylk="slk:autos;t5:autos;cpos:23;" tabindex="1">Autos</a>
</li>
  • r.text.splitlines() split the HTML source code by lines, and gives a list.

  • [575:634] is a list slice, which slices the list, and gives lines from 576 to 634. I added two more lines because without them, the output will be:

        <a href="https://www.yahoo.com/politics/" class="D(b) Fz(13px) C($topbarMenu) Py(3px) Td(n) Td(u):h"   data-ylk="slk:politics;t5:politics;cpos:9;" tabindex="1">Politics</a>
    </li>
    
    <li class="D(b)">
        <a href="https://www.yahoo.com/celebrity/" class="D(b) Fz(13px) C($topbarMenu) Py(3px) Td(n) Td(u):h"   data-ylk="slk:celebrity;t5:celebrity;cpos:10;" tabindex="1">Celebrity</a>
    </li>
    
    ...
    
    <li class="D(b)">
        <a href="https://www.yahoo.com/travel/" class="D(b) Fz(13px) C($topbarMenu) Py(3px) Td(n) Td(u):h"   data-ylk="slk:travel;t5:travel;cpos:22;" tabindex="1">Travel</a>
    </li>
    
    <li class="D(b)">
        <a href="https://www.yahoo.com/autos/" class="D(b) Fz(13px) C($topbarMenu) Py(3px) Td(n) Td(u):h"   data-ylk="slk:autos;t5:autos;cpos:23;" tabindex="1">Autos</a>
    

    And which isn't a valid HTML code block.

  • '\n'.join() joins the list by \n, and gives another string which you want.


After we have the specific lines :

>>> soup = BeautifulSoup('\n'.join(r.text.splitlines()[575:634]), 'html.parser')
>>> for i in soup.find_all('a'):
...     print(i.get('href'))
...     
... 
https://www.yahoo.com/politics/
https://www.yahoo.com/celebrity/
https://www.yahoo.com/movies/
https://www.yahoo.com/music/
https://www.yahoo.com/tv/
https://www.yahoo.com/health/
https://www.yahoo.com/style/
https://www.yahoo.com/beauty/
https://www.yahoo.com/food/
https://www.yahoo.com/parenting/
https://www.yahoo.com/makers/
https://www.yahoo.com/tech/
https://shopping.yahoo.com/
https://www.yahoo.com/travel/
https://www.yahoo.com/autos/

soup.find_all('a') finds all the <a> HTML tags in the string (HTML code block) we have, and gives a list of these tags.

Then, we use for loop over the list, and use i.get('href') to get the href attribute (the link you want) of the <a> tag.


You can also use a list comprehension to put the result into a list, rather than print it out:

import requests
from bs4 import BeautifulSoup

r = requests.get('http://www.yahoo.com')
soup = BeautifulSoup('\n'.join(r.text.splitlines()[575:634]), 'html.parser')

l = [i.get('href') for i in soup.find_all('a')]

l is the list which you're looking for.


If you also want get the title of these links, you can use i.text to get it. However, there's no table object in Python, I think you mean dict:

>>> d = {i.text: i.get('href') for i in soup.find_all('a')}
>>> pprint(d)
{'Autos': 'https://www.yahoo.com/autos/',
 'Beauty': 'https://www.yahoo.com/beauty/',
 'Celebrity': 'https://www.yahoo.com/celebrity/',
 'Food': 'https://www.yahoo.com/food/',
 'Health': 'https://www.yahoo.com/health/',
 'Makers': 'https://www.yahoo.com/makers/',
 'Movies': 'https://www.yahoo.com/movies/',
 'Music': 'https://www.yahoo.com/music/',
 'Parenting': 'https://www.yahoo.com/parenting/',
 'Politics': 'https://www.yahoo.com/politics/',
 'Shopping': 'https://shopping.yahoo.com/',
 'Style': 'https://www.yahoo.com/style/',
 'TV': 'https://www.yahoo.com/tv/',
 'Tech': 'https://www.yahoo.com/tech/',
 'Travel': 'https://www.yahoo.com/travel/'}
>>> d['TV']
'https://www.yahoo.com/tv/'
>>> d['Food']
'https://www.yahoo.com/food/'

So you can use {i.text: i.get('href') for i in soup.find_all('a')} to get the dict you want.

In this case, i.text (title) is the keys in that dict, for example 'TV' and 'Food'.

And i.get('href') is the value (links), for example 'https://www.yahoo.com/tv/' and 'https://www.yahoo.com/food/'.

You can access the value by d[key] as my code above.

Community
  • 1
  • 1
Remi Guan
  • 21,506
  • 17
  • 64
  • 87