1

I got this error while scraping:

UnboundLocalError: local variable 'tag' referenced before assignment

and it seems caused by

---> 17 return tag.select_one(".b-plainlist__date").text, tag.select_one(".b-plainlist__title").text, tag.find_next(class_="b-plainlist__announce").text.strip()

The code I am using is the following:

import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import pandas as pd

daterange = pd.date_range('02-25-2015', '09-16-2020', freq='D')

def main(req, date):
    r = req.get(f"website/{date.strftime('%Y%m%d')}")
    soup = BeautifulSoup(r.content, 'html.parser')
    for tag in soup.select(".b-plainlist "):
        print(tag.select_one(".b-plainlist__date").text)
        print(tag.select_one(".b-plainlist__title").text)
        print(tag.find_next(class_="b-plainlist__announce").text.strip())
    
    return tag.select_one(".b-plainlist__date").text, tag.select_one(".b-plainlist__title").text, tag.find_next(class_="b-plainlist__announce").text.strip()


with ThreadPoolExecutor(max_workers=30) as executor:
    with requests.Session() as req:
        fs = [executor.submit(main, req, date) for date in daterange]
        allin = []
        for f in fs:
            allin.append(f.result()) # the problem should be from here
        df = pd.DataFrame.from_records(
            allin, columns=["Date", "Title", "Content"])
   

I tried to apply some changes like in this post: UnboundLocalError: local variable 'text' referenced before assignment, but I think I have not fully understood how fixing it.

Update: this is the response of the website and the content of print (soup.select("b-plainlist"))

<Response [503]> b'\n\n\n
\n HTTP 503\n \n

\n html {font-family: "Helvetica Neue", Helvetica, Arial, sans-serif;}\n body {background-color:#fff;padding:15px;}\n div.title {font-size:32px;font-weight:bold;line-height:1.2em;}\n div.sub-title {font-size:25px;}\n div.descr {margin-top:40px;}\n div.footer {margin-top:80px;color:#777;}\n div.guru {font-size:12px;color:#ccc;}\n \n\n\n \n 503 Error\n Service Unavailable\n\n \n

Try accessing the website it.sputniknews.com in a few minutes.

\n

If the error repeats several times, contact the site administration.

\n \n\n \n \n IP: 107.181.177.10
\n Request: GET L3BvbGl0aWNhLzIwMTUwMzA4
\n Guru meditation: MGV1SjNTaWhuUHNiblJYVU96QVpxMDB6N1hDNjU5NTU=
\n \n \n\n \n\n\n'

1 Answers1

0

Try declaring the tag=None right outside your for loop like follows

def main(req, date):
r = req.get(f"website/{date.strftime('%Y%m%d')}")
soup = BeautifulSoup(r.content, 'html.parser')
tag=None
for tag in soup.select(".b-plainlist "):

The error occurs as the control never enters the loop and in-turn, variable 'tag' is never initialized. Hence when you try to return tag.select_one(".b-plainlist__date"), the compiler throws an UnboundLocalError

rahul1205
  • 794
  • 1
  • 6
  • 14
  • I think this error is caused by that None. Something does not work when I run the for loop and append the results in allin –  Nov 04 '20 at 17:22
  • Are you able to print the response from the website? Please do a print(r, r.content) right above soup=BeautifulSoup(...). I want to make sure you are actually getting a response. – rahul1205 Nov 04 '20 at 17:49
  • While you are at it, can you please also print soup.select("b-plainlist") ?. I quickly glanced through the HTML you provided and am not able to see any tags with the above mentioned class. – rahul1205 Nov 04 '20 at 17:59
  • I am not sure if the classes I have used are right, so it might be that they are wrong, though I can see the outputs. This is the response: b' . For the other print, please see the question. Thanks for your help –  Nov 04 '20 at 20:41
  • 1
    Response code 503 means service unavailable. This hence is not a BeautifulSoup error. There is something wrong with the request that you are making using the 'requests' module. Ideally, you should get a 200 code when you make a default GET request. I would suggest calling the URL using POSTMAN and debugging what is wrong with your request – rahul1205 Nov 05 '20 at 23:00