How to web scrape in an .jsf site with cookies and javascript "javax.faces.ViewState" CDATA using python requests and bs4 (BeautifulSoup) modules?

Question

I would like to automate the extraction of data from this site:

http://www.snirh.gov.br/hidroweb/publico/medicoes_historicas_abas.jsf

Explanation of the steps to be followed for extract the data that I want:

Beginning in the url above click in "Séries Históricas". You should see a page with a form with some inputs. In my case I only need to input the station code in the "Código da Estação" input. Suppose that the station code is 938001, insert that and hit "Consultar". Now you should see a lot of checkboxes. Check the one below "Selecionar", this one will check all checkboxes. Supposing that I dont want all kinds of data, I want rain rate and flow rate, I check only the checkbox below "Chuva" and the other one below "Vazão". After that is necessary to choose the type of the file that are going to be download, chose the "Arquivo Texto (.TXT)", this one is the .txt format. After that is necessary to generate the file, to do that click in "Gerar Arquivo". After that is possible todownload the file, to do that just click "Baixar Arquivo".

Note: the site now is in version v1.0.0.12, it may be different in the future.

I have a list of station codes. Imagine how bad would be to do these operations more than 1000 times?! I want to automate this!

Many people in Brazil have been trying to automate the extraction of data from this web site. Some that I found:

Really old one: https://www.youtube.com/watch?v=IWCrC0MlasQ

Others: https://pt.stackoverflow.com/questions/60124/gerar-e-baixar-links-programaticamente/86150#86150

https://pt.stackoverflow.com/questions/282111/r-download-de-dados-do-portal-hidroweb

The earlier try that I found, but that does not work too because the site have changed: https://github.com/duartejr/pyHidroWeb

So a lot people need this and none of the above solutions work more because of updates in the site.

I do not want use selenium, it is slow compared with a solution that uses requests library, and it needs a interface.

My attempt:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup
import requests
from urllib import parse


URL = 'http://www.snirh.gov.br/hidroweb/publico/apresentacao.jsf'

s = requests.Session()

r = s.get(URL)

JSESSIONID = s.cookies['JSESSIONID']

soup = BeautifulSoup(r.content, "html.parser")

javax_faces_ViewState = soup.find("input", {"type": "hidden", "name":"javax.faces.ViewState"})['value']


d = {}
d['menuLateral:menuForm'] = 'menuLateral:menuForm'
d['javax.faces.ViewState'] = javax_faces_ViewState
d['menuLateral:menuForm:menuSection:j_idt68:link'] = 'menuLateral:menuForm:menuSection:j_idt68:link'

h = {}
h['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'
h['Accept-Encoding'] = 'gzip, deflate'
h['Accept-Language'] = 'pt-BR,pt;q=0.9,en-US;q=0.8,en;q=0.7'
h['Cache-Control'] = 'max-age=0'
h['Connection'] = 'keep-alive'
h['Content-Length'] = '218'
h['Content-Type'] = 'application/x-www-form-urlencoded'
h['Cookie'] = '_ga=GA1.3.4824711.1520011013; JSESSIONID={}; _gid=GA1.3.743342153.1522450617'.format(JSESSIONID)
h['Host'] = 'www.snirh.gov.br'
h['Origin'] = 'http://www.snirh.gov.br'
h['Referer'] = 'http://www.snirh.gov.br/hidroweb/publico/apresentacao.jsf'
h['Upgrade-Insecure-Requests'] = '1'
h['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'

URL2 = 'http://www.snirh.gov.br/hidroweb/publico/medicoes_historicas_abas.jsf'
post_response = s.post(URL2, headers=h, data=d)


soup = BeautifulSoup(post_response.text, "html.parser")
javax_faces_ViewState = soup.find("input", {"type": "hidden", "name":"javax.faces.ViewState"})['value']


def f_headers(JSESSIONID):
    headers = {}
    headers['Accept'] = '*/*'
    headers['Accept-Encoding'] = 'gzip, deflate'
    headers['Accept-Language'] = 'pt-BR,pt;q=0.9,en-US;q=0.8,en;q=0.7'
    headers['Connection'] = 'keep-alive'
    headers['Content-Length'] = '672'
    headers['Content-type'] = 'application/x-www-form-urlencoded;charset=UTF-8'
    headers['Cookie'] = '_ga=GA1.3.4824711.1520011013; JSESSIONID=' + str(JSESSIONID)
    headers['Faces-Request'] = 'partial/ajax'
    headers['Host'] = 'www.snirh.gov.br'
    headers['Origin'] = 'http://www.snirh.gov.br'
    headers['Referer'] = 'http://www.snirh.gov.br/hidroweb/publico/medicoes_historicas_abas.jsf'
    headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'

    return headers


def build_data(data, n, javax_faces_ViewState):

    if n == 1:
        data['form'] = 'form'
        data['form:fsListaEstacoes:codigoEstacao'] = '938001'
        data['form:fsListaEstacoes:nomeEstacao'] = ''
        data['form:fsListaEstacoes:j_idt92'] = 'a39c3713-c0f7-4461-b2c8-c2814b3a9af1'
        data['form:fsListaEstacoes:j_idt101'] = 'a39c3713-c0f7-4461-b2c8-c2814b3a9af1'
        data['form:fsListaEstacoes:nomeResponsavel'] = ''
        data['form:fsListaEstacoes:nomeOperador'] = ''
        data['javax.faces.ViewState'] = javax_faces_ViewState
        data['javax.faces.source'] = 'form:fsListaEstacoes:bt'
        data['javax.faces.partial.event'] = 'click'
        data['javax.faces.partial.execute'] = 'form:fsListaEstacoes:bt form:fsListaEstacoes'
        data['javax.faces.partial.render'] = 'form:fsListaEstacoes:pnListaEstacoes'
        data['javax.faces.behavior.event'] = 'action'
        data['javax.faces.partial.ajax'] = 'true'


data = {}
build_data(data, 1, javax_faces_ViewState)

headers = f_headers(JSESSIONID)

post_response = s.post(URL, headers=headers, data=data)

print(post_response.text)

That prints:

<?xml version='1.0' encoding='UTF-8'?>
<partial-response><changes><update id="javax.faces.ViewState"><![CDATA[-18212878
48648292010:1675387092887841821]]></update></changes></partial-response>

Explanations about what I tryed:

I used the chrome develop tool, actually clicked "F12", clicked "Network" and in the website page clicked "Séries Históricas" to discover what are de headers and forms. I think I did it correctly. There is another way or a better way? Some people told me about postman and postman interceptor, but a dont know how to use and if it is helpful.

After that I filled the station code in the "Código da Estação" input with 938001 and hit "Consultar" to see what were the headers and forms.

Why is the site returning a xml? This means that something went wrong?

This xml has an CDATA section.

What does <![CDATA[]]> in XML mean?

A undestand the basic idea of CDATA, but how it is used in this site, and how I shoud use this in the web scrape? I guess that it is used to save partial information, but it is just a guess. I am lost.

I tryed this for the other clicks too, and got more forms and the response was xml. I did not put it here because it makes the code bigger and the xml is big too.

One SO answer that is not complete related to my is this:
https://stackoverflow.com/a/8625286

this answer explain the steps to upload a file, using java, to a JSF-generated form. This is not my case, I want to download a file using python requests.

General questions:

When is not possible and possible to use requests + bs4 to scrape a website?
Whats are the steps to do this kind of web scrape?
In cases like this site, is possible to go straightforward and in one request extract the information or we have to mimic the step by step as we would do by hand filling the form? Based on this answer it looks like the answer is no https://stackoverflow.com/a/35665529

I have faced many dificulties and doubts. In my opinion there is a gap of explanation about this kind of situation. I agree with this SO question Python urllib2 or requests post method in the point that most tutorials are useless for a situation like this site that I am trying. A question like this one https://stackoverflow.com/q/43793998/9577149 that is as hard as my does not have answer.

That is my first post in stackoverflow, sorry if I made mistakes and I am not a native english speaker, feel free to correct me.

score 0 · Answer 1 · answered Mar 31 '18 at 20:39

1) Its always possible to scrape html websites using bs4. But getting the response you would like requires more than just beautiful soup.

2) My approach with bs4 is usually as follows:

response = requests.request(
        method="GET",
        url='http://yourwebsite.com',
        params=params #(params should be a python object)
    )
soup = BeautifulSoup(response.text, 'html.parser')

3) If you notice when you fill out the first form (series historicas) and click submit, the page url (or action url) does not change. This is because an ajax request is being made to retrieve and update the data on the current page. Since you cant see the request its impossible for you to mimic that.

To submit the form i would recommend looking into Mechanize (a python library for filling and submitting form data)

import re
from mechanize import Browser

b = Browser()
b.open("http://yourwebsite.com")
b.select_form(name="form")
b["bacia"] = ["value"] 
response = b.submit()  # submit the form

score 0 · Answer 2 · answered Oct 23 '18 at 22:30

The URL of the last request is wrong. In the penultimate line of code s.post(URL, headers=headers, data=data) the parameter should be URL2 instead.

The cookie name, also, is now SESSIONID not JSESSIONID, but that must have been a change made since the question was asked.

You do not need to manage cookies manually like that when using requests.Session(), it will keep track of cookies for you automatically.

How to web scrape in an .jsf site with cookies and javascript "javax.faces.ViewState" CDATA using python requests and bs4 (BeautifulSoup) modules?

2 Answers2