Webscraping with beautifulsoup (403 error)

Question

I've been working on a project and it's not working as intended. I hope someone here can help. I have a basic understanding of python, I would really appreciate any help.

The project consists of using python and yfinance to extract some stock data and web scraping a website to extract testa quarterly revenue from a table. The problem is within part 2 while trying to download a url as a text file to be parsed by beautifulsoup and when I try to remove the comma and dollar signs.

I get a 403 error when I print(soup), it appears I'm just being blocked by the website but it seemed to have worked before. Am I wrong? Is there another way to web scrap the website without having the error?

Install the packages

    !pip install yfinance
    !pip install bs4

Imported the libraries

    import yfinance as yf
    import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    import plotly.graph_objects as go
    from plotly.subplots import make_subplots

Define the url and download the text file

    url = "https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue"
    html_data = requests.get(url).text

Parse the html data using beautifulsoup

    soup = BeautifulSoup(html_data)

    print(soup)

    <?xml version="1.0" encoding="utf-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

    403 Forbidden

    <h1>Error 403 Forbidden</h1>
    <p>Forbidden</p>
    <h3>Error 54113</h3>
    <p>Details: cache-nrt-rjtf7700062-NRT 1692742966 2184915405</p>
    <hr/>
    <p>Varnish cache server</p>

Then I try looking for the table entitled "Tesla Quarterly Revenue" with two columns for date and price.

    data = []
    for table in soup.find_all("table"):
    
    if any(["Tesla Quarterly Revenue".lower() in th.text.lower() for th in table.find_all("th")]):
        for row in table.find("tbody").find_all("tr"):
            date_col, rev_col = [col for col in row.find_all("td")]
            data.append({
                "Date": date_col.text,
                "Revenue": rev_col.text
            })

    tesla_revenue = pd.DataFrame(data)

Remove the comma and dollar sign

    tesla_revenue["Revenue"] = tesla_revenue['Revenue'].str.replace(',|\$',"")

    KeyError                                  Traceback (most recent call last)
    File \~\\anaconda3\\Lib\\site-packages\\pandas\\core\\indexes\\base.py:3802, in          Index.get_loc(self, key, method, tolerance)
    3801 try:
    \-\> 3802     return self.\_engine.get_loc(casted_key)
    3803 except KeyError as err:

    File \~\\anaconda3\\Lib\\site-packages\\pandas_libs\\index.pyx:138, in         pandas.\_libs.index.IndexEngine.get_loc()

    File \~\\anaconda3\\Lib\\site-packages\\pandas_libs\\index.pyx:165, in     pandas.\_libs.index.IndexEngine.get_loc()

    File pandas_libs\\hashtable_class_helper.pxi:5745, in     pandas.\_libs.hashtable.PyObjectHashTable.get_item()

    File pandas_libs\\hashtable_class_helper.pxi:5753, in     pandas.\_libs.hashtable.PyObjectHashTable.get_item()

    KeyError: 'Revenue'

    The above exception was the direct cause of the following exception:

    KeyError                                  Traceback (most recent call last)
    Cell In\[12\], line 1
    \----\> 1 tesla_revenue\["Revenue"\] = tesla_revenue\['Revenue'\].str.replace(',|$',"")

    File \~\\anaconda3\\Lib\\site-packages\\pandas\\core\\frame.py:3807, in     DataFrame.__getitem__(self, key)
    3805 if self.columns.nlevels \> 1:
    3806     return self.\_getitem_multilevel(key)
    \-\> 3807 indexer = self.columns.get_loc(key)
    3808 if is_integer(indexer):
    3809     indexer = \[indexer\]

    File \~\\anaconda3\\Lib\\site-packages\\pandas\\core\\indexes\\base.py:3804, in     Index.get_loc(self, key, method, tolerance)
    3802     return self.\_engine.get_loc(casted_key)
    3803 except KeyError as err:
    \-\> 3804     raise KeyError(key) from err
    3805 except TypeError:
    3806     # If we have a listlike key, \_check_indexing_error will raise
    3807     #  InvalidIndexError. Otherwise we fall through and re-raise
    3808     #  the TypeError.
    3809     self.\_check_indexing_error(key)

    KeyError: 'Revenue'

It just appears that I'm being blocked by the website so no data is being passed along. Is this correct? Any suggestions?

"Then I try looking for the table entitled "Tesla Quarterly Revenue" with two columns for date and price." - how exactly are you expecting to find them, in the HTML output that you show? I certainly don't see anything like that, I just see a 403 error. "It just appears that I'm being blocked by the website so no data is being passed along. Is this correct?" Yes, that's exactly what it means. "Any suggestions?" Have you tried reading the webpage (as in, with an ordinary web browser, and looking around the About section etc.), in order to understand their scraping policy? — Karl Knechtel, Aug 23 '23 at 03:08
(Have you considered that some website owners *actively do not want you* to scrape their website? Have you tried checking if there is an API you can use instead?) — Karl Knechtel, Aug 23 '23 at 03:09

Webscraping with beautifulsoup (403 error)

0 Answers0