1

I am trying to write a program that does chemical search on https://echa.europa.eu/ and gets the result. The "Search for Chemicals" field is on the middle of the main webpage. I want to get the resulting URLs from doing search for each chemicals by providing the cas number (ex. 67-56-1). It seems that the URL I get does not include the cas number provided.

https://echa.europa.eu/search-for-chemicals?p_p_id=disssimplesearch_WAR_disssearchportlet&p_p_lifecycle=0&_disssimplesearch_WAR_disssearchportlet_searchOccurred=true&_disssimplesearch_WAR_disssearchportlet_sessionCriteriaId=dissSimpleSearchSessionParam101401584308302720

I tried inserting different cas number (71-23-8) into "p_p_id" field, but it didn't give expected search result.
https://echa.europa.eu/search-for-chemicals?p_p_id=71-23-8

I also examined the headers of GET methods requested from Chrome which also did not include the cas number.

Is the website using variables to store the input query? Is there a way or a tool that can be used to get the resulting URL including searching cas number?

Once I figure this out, I'll be using Python to get the data and save it as excel file.

Thanks.

Bertrand Martel
  • 42,756
  • 16
  • 135
  • 159
ywbaek
  • 2,971
  • 3
  • 9
  • 28
  • The data is sent in a POST request which returns nothing. My guess is the search values are stored in SESSION variables on the server. so you can't access the data or change it. The only way you could scrape this site is using something like selenium – Dan-Dev Mar 15 '20 at 22:37

1 Answers1

1

You need to get the JESSIONID cookie by requesting the main url once then send a POST on https://echa.europa.eu/search-for-chemicals. But this needs also some required URL parameters

Using and :

query="71-23-8"
millis=$(($(date +%s%N)/1000000))
curl -s -I -c cookie.txt 'https://echa.europa.eu/search-for-chemicals'
curl -s -L -b cookie.txt 'https://echa.europa.eu/search-for-chemicals' \
    --data-urlencode "p_p_id=disssimplesearch_WAR_disssearchportlet" \
    --data-urlencode "p_p_lifecycle=1" \
    --data-urlencode "p_p_state=normal" \
    --data-urlencode "p_p_col_id=column-1" \
    --data-urlencode "p_p_col_count=2" \
    --data-urlencode "_disssimplesearch_WAR_disssearchportlet_javax.portlet.action=doSearchAction" \
    --data-urlencode "_disssimplesearch_WAR_disssearchportlet_backURL=https://echa.europa.eu/home?p_p_id=disssimplesearchhomepage_WAR_disssearchportlet&p_p_lifecycle=0&p_p_state=normal&p_p_mode=view&p_p_col_id=column-1&p_p_col_count=2" \
    --data-urlencode "_disssimplesearchhomepage_WAR_disssearchportlet_sessionCriteriaId=" \
    --data "_disssimplesearchhomepage_WAR_disssearchportlet_formDate=$millis" \
    --data "_disssimplesearch_WAR_disssearchportlet_searchOccurred=true" \
    --data "_disssimplesearch_WAR_disssearchportlet_sskeywordKey=$query" \
    --data "_disssimplesearchhomepage_WAR_disssearchportlet_disclaimer=on" \
    --data "_disssimplesearchhomepage_WAR_disssearchportlet_disclaimerCheckbox=on"

Using and scraping with

import requests
from bs4 import BeautifulSoup
import time

url = 'https://echa.europa.eu/search-for-chemicals'
query = '71-23-8'

s = requests.Session()
s.get(url)

r = s.post(url, 
    params = {
        "p_p_id": "disssimplesearch_WAR_disssearchportlet",
        "p_p_lifecycle": "1",
        "p_p_state": "normal",
        "p_p_col_id": "column-1",
        "p_p_col_count": "2",
        "_disssimplesearch_WAR_disssearchportlet_javax.portlet.action": "doSearchAction",
        "_disssimplesearch_WAR_disssearchportlet_backURL": "https://echa.europa.eu/home?p_p_id=disssimplesearchhomepage_WAR_disssearchportlet&p_p_lifecycle=0&p_p_state=normal&p_p_mode=view&p_p_col_id=column-1&p_p_col_count=2",
        "_disssimplesearchhomepage_WAR_disssearchportlet_sessionCriteriaId": ""
    },
    data = {
        "_disssimplesearchhomepage_WAR_disssearchportlet_formDate": int(round(time.time() * 1000)),
        "_disssimplesearch_WAR_disssearchportlet_searchOccurred": "true",
        "_disssimplesearch_WAR_disssearchportlet_sskeywordKey": query,
        "_disssimplesearchhomepage_WAR_disssearchportlet_disclaimer": "on",
        "_disssimplesearchhomepage_WAR_disssearchportlet_disclaimerCheckbox": "on"
    }
)
soup = BeautifulSoup(r.text, "html.parser")
table = soup.find("table")

data = [
    (
        t[0].find("a").text.strip(), 
        t[0].find("a")["href"], 
        t[0].find("div", {"class":"substanceRelevance"}).text.strip(),
        t[1].text.strip(),
        t[2].text.strip(),
        t[3].find("a")["href"] if t[3].find("a") else "",
        t[4].find("a")["href"] if t[4].find("a") else "",
    )
    for t in (t.find_all('td') for t in table.find_all("tr"))
    if len(t) > 0 and t[0].find("a") is not None
]
print(data)

Note that I've set the timestamp parameter (formDate param) in case of it's actually checked on the server

Bertrand Martel
  • 42,756
  • 16
  • 135
  • 159