0

How can I scrape the table in this link using requests? I am trying to use requests, but since the table is inside of a iframe, the html returns incomplete. I just needing the html with the table, once I have it I think I can handle with this using beatuifulsoup. Below the coding I am using:

url = 'https://www.rad.cvm.gov.br/ENETCONSULTA/frmGerenciaPaginaFRE.aspx?NumeroSequencialDocumento=89180&CodigoTipoInstituicao=2'
resp = requests.get(url, verify=False)
ManuS
  • 17
  • 3

2 Answers2

1

If you don't want to use selenium, you can use this script to load the table with requests:

import re
import requests
from bs4 import BeautifulSoup

base_url = 'http://www.rad.cvm.gov.br/ENETCONSULTA/frmGerenciaPaginaFRE.aspx?NumeroSequencialDocumento=89180&CodigoTipoInstituicao=2'

# https://stackoverflow.com/questions/38015537/python-requests-exceptions-sslerror-dh-key-too-small
requests.packages.urllib3.disable_warnings()
requests.packages.urllib3.util.ssl_.DEFAULT_CIPHERS += ':HIGH:!DH:!aNULL'
try:
    requests.packages.urllib3.contrib.pyopenssl.util.ssl_.DEFAULT_CIPHERS += ':HIGH:!DH:!aNULL'
except AttributeError:
    # no pyopenssl support used / needed / available
    pass

with requests.session() as s:
    html_data = s.get(base_url, verify=False).text
    url = 'http://www.rad.cvm.gov.br/ENETCONSULTA/' + re.search(r"window\.frames\[0\]\.location='(.*?)'", html_data).group(1)
    soup = BeautifulSoup(s.get(url, verify=False).content, 'html.parser')

    print(soup.table.prettify())

Prints:

<table id="ctl00_cphPopUp_tbDados">
 <tr>
  <td style="padding:8px 5px 8px 5px; background:#cccfd1; border-bottom:1px solid #fff !important; text-align:center; color:#ffffff; font:normal normal bold 12px 'Trebuchet MS', sans-serif;">
   Conta
  </td>
  <td style="padding:8px 5px 8px 5px; background:#cccfd1; border-bottom:1px solid #fff !important; text-align:center; color:#ffffff; font:normal normal bold 12px 'Trebuchet MS', sans-serif;">
   Descrição
  </td>
  <td style="padding:8px 5px 8px 5px; background:#cccfd1; border-bottom:1px solid #fff !important; text-align:center; color:#ffffff; font:normal normal bold 12px 'Trebuchet MS', sans-serif;">
   01/07/2019
   <br/>
   a
   <br/>
   30/09/2019
  </td>

... and so on.
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Was the use of _requests.packages.urllib3.util.ssl_.DEFAULT_CIPHERS _ and verify = False due to your being behind a firewall ? – QHarr May 03 '20 at 08:04
  • 1
    @QHarr Apparently, this server `http://www.rad.cvm.gov.br/` uses weak, insecure cipher so without it the `requests` fails to connect. I found this recipe here on SO to bypass it. – Andrej Kesely May 03 '20 at 09:04
0

The best way to achieve this is to use Selenium instead, wait for a few seconds till the iframe loads then capture the content of the iframe.

Here's an example of how to do this:

import sys
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from time import sleep

url = 'https://www.rad.cvm.gov.br/ENETCONSULTA/frmGerenciaPaginaFRE.aspx?NumeroSequencialDocumento=89180&CodigoTipoInstituicao=2'
options = Options()
# activate the following two lines to run in headless mode.
# options.add_argument('--headless')
# options.add_argument('--disable-gpu')
options.add_argument("user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36")
# /usr/bin/chromedriver is the path where I've installed chromedriver.
driver = webdriver.Chrome('/usr/bin/chromedriver', chrome_options=options)
driver.get(url)
# Wait till iframe loads
sleep(5)
html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML").encode('utf-8').strip()
# Now you have the fully-loaded HTML, you may continue to use getElementByTagName or a different library like bs4 to extract the content of the iframe. 
driver.close()
Samy
  • 629
  • 8
  • 22