3

I am new to Python and web scraping and this is my first ever question on stackoverflow. I watched several tutorials and then I tried to extract data from the table on this page: https://www.wunderground.com/hourly/ir/tehran/date/2021-04-14.

The table: TABLE

But the problem is that it seems like I can not access the right class in scrapy shell. This is my spider:

import scrapy


class SpSpider(scrapy.Spider):
    name = 'sp'
    start_urls = ['http://https://www.wunderground.com/hourly/ir/tehran/date/2021-04-14/']

    def parse(self, response):
        time = response.css('span.ng-star-inserted').extract()

And this is what I get in the terminal:

In [4]: response.css('span.ng-star-inserted::text').extract()**


Out[4]: 
['\xa0',
 'F',
 'Night',
 '\xa0',
 'in',
 '\xa0',
'miles',
'\xa0',
'F',
'\xa0',
'%',
'\xa0',
'in',
'\xa0',
'in']

I wrote this with the purpose of getting just a single data (here 12 which is the time in the table). But as you can see, the list contents are not relevant. How should I access the data?

P.S: I am working on python 3.8

Masoud Masoumi Moghadam
  • 1,094
  • 3
  • 23
  • 45
Neil
  • 49
  • 6

2 Answers2

3

Maybe a bit complicated for a beginner but never mind.

The data you are looking for is sent through an XHR request. (F12->Netword-XHR). The request you make only returns the html tags that will contain the data.

In the following code, the url I used was taken from the XHR tab. So I make the query on this url. Which returns a JSON response. Then I transform this JSON response (easily contained by the dictionary type in Python) into a Pandas dataframe.

Note that the response obtained by the query contains "all" the hourly forecasts of the available days (equivalent to when you click on the left and right arrows on the web page)

import requests as rq 
import pandas as pd

headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0"}
url = "https://api.weather.com/v3/wx/forecast/hourly/15day?apiKey=6532d6454b8aa370768e63d6ba5a832e&geocode=35.696,51.401&units=e&language=en-US&format=json"
resp = rq.get(url,  headers=headers).json()

resp.keys() ## pour observer

df = pd.DataFrame.from_dict(resp) # JSON to DF
df["validTimeLocal"] = pd.to_datetime(df["validTimeLocal"], infer_datetime_format=True) # object type to datetime type
df.sort_values(["validTimeLocal"], ascending=True, inplace=True) # sort the df by datetimes

sub_df = df[["validTimeLocal", "temperature", "precipChance"]] # select variables you want
print(sub_df.iloc[20:25]) ## print some, and compare to the website

Do some research on the words in BOLD to progress. Also take a look at the requests and bs4 package.

Note : the url contains arguments specific to your research on Tehran : geocode etc...

ce.teuf
  • 746
  • 6
  • 13
1

To get the first time, if you need only it, use css locator:

.mat-row:nth-of-type(1)>.cdk-column-timeHour>span

Second:

.mat-row:nth-of-type(2)>.cdk-column-timeHour>span

And so on.

vitaliis
  • 4,082
  • 5
  • 18
  • 40