1

I am beginning to learn the basics of webscraping with Python, but I am having a little trouble with my code. I am trying to scrape the weather from the front page of 'yahoo.com':

Weather

<div class="Ai(c) D(f) Jc(sb) Fz(13px) Py(0) Px(0)">

    <div class="D(f) Ai(c) Fld(c)">
        <span class="Fw(600) Fz(12px) Mb(10px) C($c-fuji-grey-n) Fz(1em)">Today</span>
        <i class="D(b) Bgr(nr) Bgz(ct) Bgp(c) Mb(10px) H(40px) W(40px) wafer-img-loaded" style="background-image: url(&quot;https://s.yimg.com/cv/apiv2/200510/w/l/fair_day.png&quot;);"></i>
    
    <div class="Fw(600) Fz(12px)">
        <span class="C($c-fuji-grey-n) Pend(5px) unit_F">74<span>°</span></span>
        <span class="C($c-fuji-grey-o) unit_F">59<span>°</span></span></div></div>
    
    <div class="D(f) Ai(c) Fld(c)">
        <span class="Fw(600) Fz(12px) Mb(10px) C($c-fuji-grey-n) Fz(1em)">Wed</span>
        <i class="D(b) Bgr(nr) Bgz(ct) Bgp(c) Mb(10px) H(40px) W(40px) wafer-img-loaded" style="background-image: url(&quot;https://s.yimg.com/cv/apiv2/200510/w/l/partly_cloudy_day.png&quot;);"></i>
        <span class="Hidden">Partly cloudy today with a high of 74 °F (23.3 °C) and a low of 51 °F (10.6 °C).</span>
        
        <div class="Fw(600) Fz(12px)">
            <span class="C($c-fuji-grey-n) Pend(5px) unit_F">74<span>°</span></span>
            <span class="C($c-fuji-grey-o) unit_F">51<span>°</span></span></div></div>
    
    <div class="D(f) Ai(c) Fld(c)"><span class="Fw(600) Fz(12px) Mb(10px) C($c-fuji-grey-n) Fz(1em)">Thu</span>
        <i class="D(b) Bgr(nr) Bgz(ct) Bgp(c) Mb(10px) H(40px) W(40px) wafer-img-loaded" style="background-image: url(&quot;https://s.yimg.com/cv/apiv2/200510/w/l/partly_cloudy_day.png&quot;);"></i>
        <span class="Hidden">Partly cloudy today with a high of 84 °F (28.9 °C) and a low of 51 °F (10.6 °C).</span>
        
        <div class="Fw(600) Fz(12px)">
            <span class="C($c-fuji-grey-n) Pend(5px) unit_F">84<span>°</span></span>
            <span class="C($c-fuji-grey-o) unit_F">51<span>°</span></span></div></div>
    
    <div class="D(f) Ai(c) Fld(c)">
        <span class="Fw(600) Fz(12px) Mb(10px) C($c-fuji-grey-n) Fz(1em)">Fri</span>
        <i class="D(b) Bgr(nr) Bgz(ct) Bgp(c) Mb(10px) H(40px) W(40px) wafer-img-loaded" style="background-image: url(&quot;https://s.yimg.com/cv/apiv2/200510/w/l/scattered_showers_day_night.png&quot;);"></i>
        <span class="Hidden">Scattered thunderstorms today with a high of 84 °F (28.9 °C) and a low of 65 °F (18.3 °C).  There is a 35% chance of precipitation.</span>
        
        <div class="Fw(600) Fz(12px)">
            <span class="C($c-fuji-grey-n) Pend(5px) unit_F">84<span>°</span></span>
            <span class="C($c-fuji-grey-o) unit_F">65<span>°</span></span></div></div></div>

This is the code I have come up with to try and pull this information:

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.yahoo.com/')

soup = BeautifulSoup(r.content, 'html.parser')

weatherTable = soup.select_one("div.Ai(c).D(f).Jc(sb).Fz(13px).Py(0).Px(0)")

for row in weatherTable.select("div.D(f).Ai(c).Fld(c)"):
    day = row.select_one("span.Fw(600).Fz(12px).Mb(10px).C($c-fuji-grey-n).Fz(1em)").text
    dWeather = row.select_one("span.C($c-fuji-grey-n).Pend(5px).unit_F").text
    nWeather = row.select_one("span.C($c-fuji-grey-o).unit_F").text
    print(day, dWeather, nWeather)

When I try to run my code, I get the following error:

Traceback (most recent call last):
  File "C:\Users\smith\eclipse-workspace\Practice\src\DecodeWeb.py", line 9, in <module>
    weatherTable = soup.select_one("div.Ai(c).D(f).Jc(sb).Fz(13px).Py(0).Px(0)")
  File "C:\Users\smith\AppData\Local\Programs\Python\Python39\lib\bs4\element.py", line 1834, in select_one
    value = self.select(selector, namespaces, 1, **kwargs)
  File "C:\Users\smith\AppData\Local\Programs\Python\Python39\lib\bs4\element.py", line 1869, in select
    results = soupsieve.select(selector, self, namespaces, limit, **kwargs)
  File "C:\Users\smith\AppData\Local\Programs\Python\Python39\lib\soupsieve\__init__.py", line 98, in select
    return compile(select, namespaces, flags, **kwargs).select(tag, limit)
  File "C:\Users\smith\AppData\Local\Programs\Python\Python39\lib\soupsieve\__init__.py", line 62, in compile
    return cp._cached_css_compile(pattern, namespaces, custom, flags)
  File "C:\Users\smith\AppData\Local\Programs\Python\Python39\lib\soupsieve\css_parser.py", line 211, in _cached_css_compile
    CSSParser(pattern, custom=custom_selectors, flags=flags).process_selectors(),
  File "C:\Users\smith\AppData\Local\Programs\Python\Python39\lib\soupsieve\css_parser.py", line 1058, in process_selectors
    return self.parse_selectors(self.selector_iter(self.pattern), index, flags)
  File "C:\Users\smith\AppData\Local\Programs\Python\Python39\lib\soupsieve\css_parser.py", line 909, in parse_selectors
    key, m = next(iselector)
  File "C:\Users\smith\AppData\Local\Programs\Python\Python39\lib\soupsieve\css_parser.py", line 1051, in selector_iter
    raise SelectorSyntaxError(msg, self.pattern, index)
soupsieve.util.SelectorSyntaxError: Invalid character '(' position 6
  line 1:
div.Ai(c).D(f).Jc(sb).Fz(13px).Py(0).Px(0)

Do I have to substitute the special characters so that BS4 can read the classnames?

2 Answers2

1

The problem is that your CSS selectors include parentheses () and dollar signs $. These symbols already have a special meaning. See:

You can escape these characters using a backslash \.

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.yahoo.com/')

soup = BeautifulSoup(r.content, 'html.parser')
weatherTable = soup.select_one("div.Ai\(c\).D\(f\).Jc\(sb\).Fz\(13px\).Py\(0\).Px\(0\)")

for row in weatherTable.select("div.D\(f\).Ai\(c\).Fld\(c\)"):
    day = row.select_one("span.Fw\(600\).Fz\(12px\).Mb\(10px\).C\(\$c-fuji-grey-n\).Fz\(1em\)").text
    dWeather = row.select_one("span.C\(\$c-fuji-grey-n\).Pend\(5px\).unit_F").text
    nWeather = row.select_one("span.C\(\$c-fuji-grey-o\).unit_F").text
    print(day, dWeather, nWeather)

Output:

Today 82° 67°
Wed 78° 63°
Thu 78° 59°
Fri 81° 62°

An alternative instead of escaping these characters using a backslash would be to use an [attribute=value].

For example, instead of doing:

day = row.select_one(
        "span.Fw\(600\).Fz\(12px\).Mb\(10px\).C\(\$c-fuji-grey-n\).Fz\(1em\)"
    ).text

You can do:

day = soup.select_one(
    # Find a `span` with the class-name `Fw(600) Fz(12px) Mb(10px) C($c-fuji-grey-n) Fz(1em)`
    'span[class="Fw(600) Fz(12px) Mb(10px) C($c-fuji-grey-n) Fz(1em)"]'
).text

which is much more readable.

MendelG
  • 14,885
  • 4
  • 25
  • 52
0

Whilst the reason for the failure is explained already (regarding escaping certain characters), those classes are dynamic and so a scrape based on that is likely to break fairly soon. Consider instead using more stable elements/attributes and their relationships to do the scrape:

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.yahoo.com/')
soup = BeautifulSoup(r.content, 'html.parser')
data = [i.get_text() for i in soup.select('.weather-card-content span:not(:nth-child(3n))') if i.text != '°']
print(list(zip(data[0::3],data[1::3],data[2::3])))
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Can you elaborate on this a little bit? Where did "soup.select('.weather-card-content span:not(:nth-child(3n))')" come from? I see the "weather-card-content", but not the "span:not(:nth-child(3n))". I also get an error when running this: "Non-UTF-8 code starting with '\xb0' on line 6, but no encoding declared;" – legendaryxv2 Jun 16 '21 at 01:05