0

My goal is to get the PQRI table (second table of the two listed) from this Webpage using Python.
As it is an ajax table, I tried the following:

  • Open the webpage in Chrome
  • Open developer tools -> Network -> Fetch/XHR to get the request URL, request Headers and Payload.
  • Using the request library to make a post request:
url = "https://apps.usp.org/ajax/USPNF/columnsDB.php"


headers = {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Connection": "keep-alive",
"Content-Length": "201",
"Content-Type": "application/x-www-form-urlencoded",
"Cookie": "_fbp=fb.1.1646747716384.2068133566; tc_ptid=3U21FqQ3bklFEULP2jijnQ; tc_ptidexpiry=1709819716801; BE_CLA3=p_id%3D8A64RLL6L464RLNNA48664N2RAAAAAAAAH%26bf%3D8d70551f1d08356108a60fc4a2db91d0%26bn%3D1%26bv%3D3.44%26s_expire%3D1648554934915%26s_id%3D8A64RLL6L464RJ2L8J6664N2RAAAAAAAAH; _gid=GA1.2.1041569168.1648468535; _ga_DTGQ04CR27=GS1.1.1648468535.10.0.1648468535.0; USPSESSID=u6i1i80ot1uk49mnauim3o7l37; _ga=GA1.2.1946138806.1646747717; BIGipServerprod_apps.usp.org_http_pool=1271466250.20480.0000",
"Host": "apps.usp.org",
"Origin": "https://apps.usp.org",
"Referer": "https://apps.usp.org/app/USPNF/columnsDB.html",
"sec-ch-ua": "Not A;Brand ;v=99, Chromium;v=99, Google Chrome;v=99",
"sec-ch-ua-mobile" : "?0",
"sec-ch-ua-platform": "Windows",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36",
"X-Powered-By": "CPAINT v2.1.0 :: http://sf.net/projects/cpaint",
}

payload = {
"cpaint_function": "updatePQRIResults",
"cpaint_argument[]": "Acclaim%20120%20C18",
"cpaint_argument[]": 0,
"cpaint_argument[]": 0,
"cpaint_argument[]": 0,
"cpaint_argument[]": 2.8,
"cpaint_argument[]": 0,
"cpaint_response_type": "OBJECT",
}

response = requests.post(url, data=payload, headers=headers)

I see the desired output in the developer tool: enter image description here

But when I make the request I only get the following response:

"<c_start></c_start><c_total></c_total>getPQRIData: No base column '0'\u003cbr\u003e\u000a"

Any idea what I need to change to get the desired output?

päger
  • 27
  • 4
  • value in `"Content-Length"` depends on size of data in `payload` and requests should calculate it automatically - so don't add it manually. – furas Mar 29 '22 at 09:27
  • requests automatically encode values in `payload` but you have already encoded `"Acclaim%20120%20C18"` - so it will encode it again and this creates wrong value. If you have already encoded data then put it all as single string OR you have to put unencoded value `Acclaim 300 C18` – furas Mar 29 '22 at 09:29
  • pages often check also cookies - so you may need to create `requests.Session()` and first GET main page to get fresh cookies, and later use POST (automatically with cookies) to get data. – furas Mar 29 '22 at 09:33
  • `payload` is a dictionary and dictionary may have only one key `"cpaint_argument[]"` - so it keep only last value `"cpaint_argument[]": 0,` and it sends only this value. – furas Mar 29 '22 at 09:38

1 Answers1

2

You can't send that form data as a dictionary/json. Send it as a string and it should work:

import pandas as pd
import requests


s = requests.Session()
s.get('https://apps.usp.org/app/USPNF/columnsDB.html')
cookies = s.cookies.get_dict()

cookieStr = ''
for k,v in cookies.items():
    cookieStr += f'{k}={v};'

url = "https://apps.usp.org/ajax/USPNF/columnsDB.php"
headers = {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Connection": "keep-alive",
"Content-Length": "201",
"Content-Type": "application/x-www-form-urlencoded",
"Cookie": cookieStr,
"Host": "apps.usp.org",
"Origin": "https://apps.usp.org",
"Referer": "https://apps.usp.org/app/USPNF/columnsDB.html",
"sec-ch-ua": "Not A;Brand ;v=99, Chromium;v=99, Google Chrome;v=99",
"sec-ch-ua-mobile" : "?0",
"sec-ch-ua-platform": "Windows",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.141 Safari/537.36",
"X-Powered-By": "CPAINT v2.1.0 :: http://sf.net/projects/cpaint",
}

final_df = pd.DataFrame()
nextPage = True

page = 0
while nextPage == True:
    i = page*10
    payload = f'cpaint_function=updatePQRIResults&cpaint_argument[]=Acclaim%20120%20C18&cpaint_argument[]=1&cpaint_argument[]=0&cpaint_argument[]=0&cpaint_argument[]=2.8&cpaint_argument[]={i}&cpaint_response_type=OBJECT'
    
    response = s.post(url, data=payload, headers=headers).text
    
    df = pd.read_xml(response).iloc[3:-1,3:]
    
    if (df.iloc[0]['psr'] == 0) and (len(df) == 1):
        nextPage = False
        final_df = final_df.drop_duplicates().reset_index(drop=True)
        
        print('Complete')
    
    else:
        final_df = pd.concat([final_df, df], axis=0)
        
        print(f'Page: {page + 1}')
        page+=1
    

Output:

print(final_df)
       psr    psf                  psn  ...   psvb psvc28 psvc70
0      0.0   0.00      Acclaim 120 C18  ... -0.027  0.086 -0.002
1      1.0   0.24      TSKgel ODS-100Z  ... -0.031 -0.064 -0.161
2      2.0   0.67       Inertsil ODS-3  ... -0.023 -0.474 -0.334
3      3.0   0.74          LaChrom C18  ... -0.006 -0.278 -0.120
4      4.0   0.80       Prodigy ODS(3)  ... -0.012 -0.195 -0.134
..     ...    ...                  ...  ...    ...    ...    ...
753  753.0  29.55        Cosmosil 5PYE  ...  0.092  0.521  1.318
754  754.0  30.44      BioBasic Phenyl  ...  0.217  0.014  0.390
755  755.0  34.56  Microsorb-MV 100 CN  ... -0.029  0.148  0.785
756  756.0  41.62      Inertsil ODS-EP  ...  0.050 -0.620 -0.070
757  757.0  41.84           Flare C18+  ...  0.966 -0.507  1.178

[758 rows x 12 columns]
chitown88
  • 27,527
  • 4
  • 30
  • 59
  • 1
    code works for me even without `"Cookie": cookieStr` - `requests.Session()` should care of cookies and there is no need to copy from one request to another. – furas Mar 29 '22 at 11:12
  • Works like a charm. Sending the payload as a string did the job. Thank you both! – päger Mar 29 '22 at 11:54
  • @furas, good point. I hadn't even checked. If it works without that ya, eliminate it. I've just came across one time where using the `requests.Sessions()` hadn't worked, and then had to reconstruct the cookie string to add into the headers. Likely just sort of a one off thing, but then obviously just got into the habit of lways doing it. – chitown88 Mar 29 '22 at 12:03
  • 1
    I never met with this problem but some pages can suprise. It would be interesting to see this problem on some page. BTW: and page from question makes me other problem with SSL: `dh key too small` similar to [ssl - Python - requests.exceptions.SSLError - dh key too small - Stack Overflow](https://stackoverflow.com/questions/38015537/python-requests-exceptions-sslerror-dh-key-too-small) – furas Mar 29 '22 at 12:14