0

I have been trying to webscrape two websites for data but facing issues. I will be extremely glad if anyone can help in resolving the problem

1.https://online.capitalcube.com/ The website requires one to login. I came up with the following code after watching tutorials on youtube for the last 2 days.

from bs4 import BeautifulSoup
import pandas as pd
import requests

URL = 'https://online.capitalcube.com/'
LOGIN_ROUTE = '/login'

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:93.0) Gecko/20100101 Firefox/93.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
    'TE': 'trailers',
}

s = requests.session()

login_payload = {
    'email': '<intentionally removed it>',
    'password': '<intentionally removed it>'
}

login_req = s.post(URL + LOGIN_ROUTE, headers = headers, data = login_payload)

print(login_req.status_code)

The error i am getting is as follows

*Traceback (most recent call last): File "/Users/rafatsiddiqui/opt/anaconda3/envs/scientificProject/lib/python3.9/site-packages/urllib3/connectionpool.py", line 699, in urlopen httplib_response = self._make_request( File "/Users/rafatsiddiqui/opt/anaconda3/envs/scientificProject/lib/python3.9/site-packages/urllib3/connectionpool.py", line 382, in _make_request self._validate_conn(conn) File "/Users/rafatsiddiqui/opt/anaconda3/envs/scientificProject/lib/python3.9/site-packages/urllib3/connectionpool.py", line 1010, in validate_conn conn.connect() File "/Users/rafatsiddiqui/opt/anaconda3/envs/scientificProject/lib/python3.9/site-packages/urllib3/connection.py", line 416, in connect self.sock = ssl_wrap_socket( File "/Users/rafatsiddiqui/opt/anaconda3/envs/scientificProject/lib/python3.9/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket ssl_sock = ssl_wrap_socket_impl( File "/Users/rafatsiddiqui/opt/anaconda3/envs/scientificProject/lib/python3.9/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/Users/rafatsiddiqui/opt/anaconda3/envs/scientificProject/lib/python3.9/ssl.py", line 500, in wrap_socket return self.sslsocket_class._create( File "/Users/rafatsiddiqui/opt/anaconda3/envs/scientificProject/lib/python3.9/ssl.py", line 1040, in _create self.do_handshake() File "/Users/rafatsiddiqui/opt/anaconda3/envs/scientificProject/lib/python3.9/ssl.py", line 1309, in do_handshake self._sslobj.do_handshake() ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/Users/rafatsiddiqui/opt/anaconda3/envs/scientificProject/lib/python3.9/site-packages/requests/adapters.py", line 439, in send resp = conn.urlopen( File "/Users/rafatsiddiqui/opt/anaconda3/envs/scientificProject/lib/python3.9/site-packages/urllib3/connectionpool.py", line 755, in urlopen retries = retries.increment( File "/Users/rafatsiddiqui/opt/anaconda3/envs/scientificProject/lib/python3.9/site-packages/urllib3/util/retry.py", line 574, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='online.capitalcube.com', port=443): Max retries exceeded with url: //login (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "", line 30, in File "/Users/rafatsiddiqui/opt/anaconda3/envs/scientificProject/lib/python3.9/site-packages/requests/sessions.py", line 590, in post return self.request('POST', url, data=data, json=json, **kwargs) File "/Users/rafatsiddiqui/opt/anaconda3/envs/scientificProject/lib/python3.9/site-packages/requests/sessions.py", line 542, in request resp = self.send(prep, **send_kwargs) File "/Users/rafatsiddiqui/opt/anaconda3/envs/scientificProject/lib/python3.9/site-packages/requests/sessions.py", line 655, in send r = adapter.send(request, *kwargs) File "/Users/rafatsiddiqui/opt/anaconda3/envs/scientificProject/lib/python3.9/site-packages/requests/adapters.py", line 514, in send raise SSLError(e, request=request) requests.exceptions.SSLError: HTTPSConnectionPool(host='online.capitalcube.com', port=443): Max retries exceeded with url: //login (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)')))

  1. The other website I am trying is stockedge.com I have come up with the following code
import requests
from bs4 import BeautifulSoup
import pandas as pd
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:93.0) Gecko/20100101 Firefox/93.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Upgrade-Insecure-Requests': '1',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
    'Connection': 'keep-alive',
}

ticker = 'hdfc-bank/5051?'
urls = {}
urls['balancesheet consolidated'] = f"https://web.stockedge.com/share/{ticker}section=balance-sheet"
urls['balancesheet standalone'] = f"https://web.stockedge.com/share/{ticker}section=balance-sheet&statement-type=standalone"
urls['profitloss consolidated'] = f"https://web.stockedge.com/share/{ticker}section=profit-loss&statement-type=consolidated"
urls['profitloss standalone'] = f"https://web.stockedge.com/share/{ticker}section=profit-loss&statement-type=standalone"
urls['cashflow consolidated'] = f"https://web.stockedge.com/share/{ticker}section=cash-flow"
urls['cashflow standalone'] = f"https://web.stockedge.com/share/{ticker}section=cash-flow&statement-type=standalone"
urls['quarterlyresults consolidated'] = f"https://web.stockedge.com/share/{ticker}section=results"
urls['quarterlyresults standalone'] = f"https://web.stockedge.com/share/{ticker}section=results&active-statement-type=Standalone"
urls['shareholding pattern'] = f"https://web.stockedge.com/share/{ticker}section=pattern"
urls['return ratios'] = f"https://web.stockedge.com/share/{ticker}section=ratios&ratio-id=roe"
urls['efficiency ratios'] = f"https://web.stockedge.com/share/{ticker}section=ratios&ratio-id=roe&ratio-category=efficiencyratios"
urls['growth ratios'] = f"https://web.stockedge.com/share/{ticker}section=ratios&ratio-id=roe&ratio-category=growthratios"
urls['solvency ratios'] = f"https://web.stockedge.com/share/{ticker}section=ratios&ratio-id=net_sales_growth&ratio-category=solvencyratios"
urls['cashflow ratios'] = f"https://web.stockedge.com/share/{ticker}section=ratios&ratio-id=net_sales_growth&ratio-category=cashflowratios"
urls['valuation ratios'] = f"https://web.stockedge.com/share/{ticker}section=ratios&ratio-id=net_sales_growth&ratio-category=valuationratios"

xlwriter = pd.ExcelWriter(f'financial statements ({ticker}).xlsx', engine='xlsxwriter')

for key in urls.keys():
    response = requests.get(urls[key], headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    df = pd.read_html(str(soup), attrs={'class': 'background md list-md hydrated'})[0]
    df.to_excel(xlwriter, sheet_name=key, index=False)

xlwriter.save()

The error I am getting is

runfile('/Users/rafatsiddiqui/Downloads/scientificProject/Company Financial Webscrape.py', wdir='/Users/rafatsiddiqui/Downloads/scientificProject') Traceback (most recent call last): File "", line 1, in File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_umd.py", line 198, in runfile pydev_imports.execfile(filename, global_vars, local_vars) # execute the script File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "/Users/rafatsiddiqui/Downloads/scientificProject/Company Financial Webscrape.py", line 36, in xlwriter = pd.ExcelWriter(f'financial statements ({ticker}).xlsx', engine='xlsxwriter') File "/Users/rafatsiddiqui/opt/anaconda3/envs/scientificProject/lib/python3.9/site-packages/pandas/io/excel/_xlsxwriter.py", line 191, in init super().init( File "/Users/rafatsiddiqui/opt/anaconda3/envs/scientificProject/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 925, in init self.handles = get_handle( File "/Users/rafatsiddiqui/opt/anaconda3/envs/scientificProject/lib/python3.9/site-packages/pandas/io/common.py", line 711, in get_handle handle = open(handle, ioargs.mode) FileNotFoundError: [Errno 2] No such file or directory: 'financial statements (hdfc-bank/5051?).xlsx'

  • For your second code snippet, the way you're deriving the filename from the `ticker` right now, you're not checking to see if it contains any illegal characters. Your output XLSX filenames should not contain any forward slashes `/` or question marks `?`. `financial statements (hdfc-bank/5051?).xlsx` was the filename generated in your case, which is invalid. – Paul M. Oct 29 '21 at 15:53
  • In the first one, you are getting an error regarding SSL certificates. Specifically, SSL certificate verify failed. This is because you are not providing the certificate you need to verify against while trying to access a HTTPS enabled site. Take a look at this: https://stackoverflow.com/questions/46568969/python-requests-ssl-error-certificate-verify-failed – Shubham Vasaikar Oct 29 '21 at 15:56
  • I actually checked HTML code and there are no csrf tokens. Infact, on checking html code under network, I am unable to see POST request for login. Therefore i took curl of get request and converted to code on curl.trillworks.com. But this does not seem to work. Can you help with this – Rafat Siddiqui Oct 29 '21 at 16:06
  • @PaulM. I got your tip and changed the code to this ``` ticker1 = 'hdfc-bank' ticker2 = '/5051?' ticker = f'{ticker1}{ticker2}' xlwriter = pd.ExcelWriter(f'financial statements ({ticker1}).xlsx', engine='xlsxwriter') ``` The error I am facing now is raise ValueError("No tables found") ValueError: No tables found* The page I am trying to extract is https://web.stockedge.com/share/hdfc-bank/5051?section=profit-loss&statement-type=consolidated I searched for tag named table but could not find it. Therefore used class attribute. Can you help with this – Rafat Siddiqui Oct 29 '21 at 16:11
  • @RafatSiddiqui The page you are trying to scrape from is using the Angular framework/JavaScript to populate the DOM asynchronously. Making a simple HTTP GET request to that page will only return the empty HTML template to you, so BeautifulSoup/Pandas won't see the table. The information you're trying to scrape is being retrieved from a REST API via an XHR HTTP GET request initiated by JavaScript on the browser (again, asynchronously). You need to imitate that XHR HTTP GET request (you don't need BeautifulSoup or Pandas for this). – Paul M. Oct 29 '21 at 17:01
  • @RafatSiddiqui Look up tutorials on using your browser's developer tools (Google Chrome's Devtools, for example). Learn how to log your network traffic to discover the XHR HTTP GET request. Imitate that request (copy API endpoint URL, query-string parameters, request headers) to get a JSON response from the REST API. Take a look at [this answer](https://stackoverflow.com/questions/65585597/how-to-click-a-link-by-text-with-no-text-in-python/65585861#65585861) I posted on another question, where I go more in-depth. – Paul M. Oct 29 '21 at 17:03

0 Answers0