2

I'm using requests to compile a custom URL and one parameter includes a pound sign. Can anyone explain how to pass the parameter without encoding the pound sign?

This returns the correct CSV file

results_url = 'https://baseballsavant.mlb.com/statcast_search/csv?all=true&hfPT=&hfAB=&hfBBT=&hfPR=&hfZ=&stadium=&hfBBL=&hfNewZones=&hfGT=R%7C&hfC=&hfSea=2019%7C&hfSit=&player_type=batter&hfOuts=&opponent=&pitcher_throws=&batter_stands=&hfSA=&game_date_gt=&game_date_lt=&hfInfield=&team=&position=&hfOutfield=&hfRO=&home_road=&hfFlag=&hfPull=&metric_1=&hfInn=&min_pitches=0&min_results=0&group_by=name&sort_col=pitches&player_event_sort=h_launch_speed&sort_order=desc&min_abs=0&type=#results'
results = requests.get(results_url, timeout=30).content
results_df = pd.read_csv(io.StringIO(results.decode('utf-8')))

This DOES NOT

URL = 'https://baseballsavant.mlb.com/statcast_search/csv?'

def _get_statcast(params):

     _get = get(URL, params=params, timeout=30)
     _get.raise_for_status()
     return _get.content

The issue seems to be that when passing '#results' through requests anything after '#' gets ignored which causes the wrong CSV to be downloaded. If anyone has thoughts on other ways of going about this I would appreciate it.

EDIT2: Also asked this on the python forum https://python-forum.io/Thread-Handling-pound-sign-within-custom-URL?pid=75946#pid75946

Nick
  • 367
  • 4
  • 16
  • 6
    Why do you _not_ want to encode the pound sign? – DYZ Mar 30 '19 at 20:30
  • 4
    `www.example.com/type=%23results` sounds like the correct URL. – user2357112 Mar 30 '19 at 20:30
  • 3
    The `#` is a special character in uri **not** meant to be send to the server side. See this: https://stackoverflow.com/questions/317760/how-to-get-url-hash-from-server-side#answer-318581 – freakish Mar 30 '19 at 20:30
  • @DYZ The url is a CSV download. '%23results' downloads a file I don't need and '#results' downloads the file I do need. – Nick Mar 30 '19 at 20:34
  • 1
    @Nick How do you know that `#` url works? Did you test it via browser? Browsers strip `#` sign before sending a request. Have you tried downloading **without** the pound suffix, i.e. the `www.example.com/type=#results` should be equivalent to `www.example.com/type=`. So basically just remove the pound suffix when you read the csv and you should be ok. – freakish Mar 30 '19 at 20:35
  • `#` is a reserved character as defined by RFC-3986. Whether or not you really need to *encode* it depends on what protocol you are using. However, whoever receives it should absolutely be *decoding* it. – chepner Mar 30 '19 at 20:40
  • @freakish Yes, I have tested it via browser. Simply removing # changes the CSV from the one I want to the one I don't want. – Nick Mar 30 '19 at 20:45
  • can you share the actual url? – Liam Apr 02 '19 at 18:29
  • 1
    You should provide a minimal working and verifiable example of what you are trying to do : what files do exist ? Where ? What are the transformations you perform between http request reception and the http request response ? As people said, # is special and you should work on encoding/decoding in order to return the correct data. %23 corresponds to #, and % is %25 (see https://www.w3schools.com/tags/ref_urlencode.asp), so you can differentiate the two. – LoneWanderer Apr 02 '19 at 20:00

3 Answers3

9

Basically, anything after a literal pound-sign in the URL is not sent to the server. This applies to browsers and requests.

The format of your URL suggests that the type=#results part is actually a query parameter.

requests will automatically encode the query parameters, while the browser won't. Below are various queries and what the server receives in each case:


URL parameter in the browser

When using the pound-sign in the browser, anything after the pond-sign is not sent to the server:

https://httpbin.org/anything/type=#results

Returns:

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Accept-Language": "en-GB,en;q=0.9,en-US;q=0.8,de;q=0.7", 
    "Cache-Control": "max-age=0", 
    "Host": "httpbin.org", 
    "Upgrade-Insecure-Requests": "1", 
    "User-Agent": "*redacted*"
  }, 
  "json": null, 
  "method": "GET", 
  "origin": "*redacted*", 
  "url": "https://httpbin.org/anything/type="
}
  • The URL received by the server is https://httpbin.org/anything/type=.
  • The page being requested is called type= which does not seem to be correct.

Query parameter in the browser

The <key>=<value> format suggest it might be a query parameter which you are passing. Still, anything after the pound-sign is not sent to the server:

https://httpbin.org/anything?type=#results

Returns:

{
  "args": {
    "type": ""
  }, 
  "data": "", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Accept-Language": "en-GB,en;q=0.9,en-US;q=0.8,de;q=0.7", 
    "Host": "httpbin.org", 
    "Upgrade-Insecure-Requests": "1", 
    "User-Agent": "*redacted*"
  }, 
  "json": null, 
  "method": "GET", 
  "origin": "*redacted*", 
  "url": "https://httpbin.org/anything?type="
}
  • The URL received by the server is https://httpbin.org/anything?type=.
  • The page being requested is called anything.
  • An argument type without a value is received.

Encoded query parameter in the browser

https://httpbin.org/anything?type=%23results

Returns:

{
  "args": {
    "type": "#results"
  }, 
  "data": "", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Accept-Language": "en-GB,en;q=0.9,en-US;q=0.8,de;q=0.7", 
    "Host": "httpbin.org", 
    "Upgrade-Insecure-Requests": "1", 
    "User-Agent": "*redacted*"
  }, 
  "json": null, 
  "method": "GET", 
  "origin": "*redacted*", 
  "url": "https://httpbin.org/anything?type=%23results"
}
  • The URL received by the server is https://httpbin.org/anything?type=%23results.
  • The page being requested is called anything.
  • An argument type with a value of #results is received.

Python requests with URL parameter

requests will also not send anything after the pound-sign to the server:

import requests

r = requests.get('https://httpbin.org/anything/type=#results')
print(r.url)
print(r.json())

Returns:

https://httpbin.org/anything/type=#results
{
    "args": {},
    "data": "",
    "files": {},
    "form": {},
    "headers": {
        "Accept": "*/*",
        "Accept-Encoding": "gzip, deflate",
        "Host": "httpbin.org",
        "User-Agent": "python-requests/2.21.0"
    },
    "json": null,
    "method": "GET",
    "origin": "*redacted*",
    "url": "https://httpbin.org/anything/type="
}
  • The URL received by the server is https://httpbin.org/anything?type=.
  • The page being requested is called anything.
  • An argument type without a value is received.

Python requests with query parameter

requests automatically encodes query parameters:

import requests

r = requests.get('https://httpbin.org/anything', params={'type': '#results'})
print(r.url)
print(r.json())

Returns:

https://httpbin.org/anything?type=%23results
{
    "args": {
        "type": "#results"
    },
    "data": "",
    "files": {},
    "form": {},
    "headers": {
        "Accept": "*/*",
        "Accept-Encoding": "gzip, deflate",
        "Host": "httpbin.org",
        "User-Agent": "python-requests/2.21.0"
    },
    "json": null,
    "method": "GET",
    "origin": "*redacted*",
    "url": "https://httpbin.org/anything?type=%23results"
}
  • The URL received by the server is https://httpbin.org/anything?type=%23results.
  • The page being requested is called anything.
  • An argument type with a value of #results is received.

Python requests with doubly-encoded query parameter

If you manually encode the query parameter and then pass it to requests, it will encode the already encoded query parameter again:

import requests

r = requests.get('https://httpbin.org/anything', params={'type': '%23results'})
print(r.url)
print(r.json())

Returns:

https://httpbin.org/anything?type=%23results
{
    "args": {
        "type": "%23results"
    },
    "data": "",
    "files": {},
    "form": {},
    "headers": {
        "Accept": "*/*",
        "Accept-Encoding": "gzip, deflate",
        "Host": "httpbin.org",
        "User-Agent": "python-requests/2.21.0"
    },
    "json": null,
    "method": "GET",
    "origin": "*redacted*",
    "url": "https://httpbin.org/anything?type=%2523results"
}
  • The URL received by the server is https://httpbin.org/anything?type=%2523results.
  • The page being requested is called anything.
  • An argument type with a value of %23results is received.
Cloudomation
  • 1,597
  • 1
  • 6
  • 15
  • 1
    Thanks for the explanation. Now I realize why the main CSV was getting returned regardless of any changes I made to the params. – Nick Apr 02 '19 at 21:55
0

The answer by Cloudomation provides a lot of interesting information but I think it may not be what you are looking for. Assuming this identical thread in the python forum is written by you as well, read on:

From the information you provided it seems that type=#results is being used to filter the original csv and return only parts of the data.
If this is the case, the type= part is not really necessary (try the URL without it and see that you get the same results).

I'll explain:

The # symbol in URLS is called a fragment identifier and in different kinds of pages it serves different purposes. In text/csv pages, it serves to filter the csv table by column, row or some combination of the two. You can read more about it here.

In your case, results could be a query parameter that is used to filter the csv table in a custom way.

Unfortunately, as illustrated in Cloudomation's answer, the fragmented data is not available on the server side, so you will not be able to access it via a python request parameter in the way you tried.

You could try to access it in Javascript as suggested here or simply download the entire (unfiltered) CSV table and filter it yourself.

There are many ways to do this easily and efficiently in python. Look here for more information, or if you need more control you can import the CSV into a pandas dataframe.


**EDIT:**

I see you found a workaround by joining the strings and passing a second request. Since this works, you could probably get away with converting the params to string (as suggested here). If it does what you're after this would be a more efficient and perhaps slightly more elegant solution:

params = {'key1': 'value1', 'key2': 'value2'} // sample params dict

def _get_statcast_results(params):

    // convert params to string - alternatively you can  use %-formatting 
    params_str = "&".join(f"{k}={v}" for k,v in payload.items())

    s = session()

    data = s.get(statcast_url, params = params_str, timeout=30)

    return data.content

Community
  • 1
  • 1
yuvgin
  • 1,322
  • 1
  • 12
  • 27
  • Lol, ya that was me on the python forum. I figured sooner or later I would get some sort of feedback between the two. Just to reiterate what I said at the top...I know requests in itself isn't the problem because I can pull the correct data without using params (I just need to use params in order to choose other specific situations, if that makes sense). – Nick Apr 02 '19 at 21:51
  • @Nick in the future when you cross-post on multiple sites, linking to them so that the question doesn't get answered twice is appreciated. – micseydel Apr 02 '19 at 22:00
  • @micseydel I'll keep that in mind. Normally I just stick to asking here but wasn't getting much feedback and really need to get this solved. – Nick Apr 02 '19 at 22:03
0

I've only gotten through one trial but hopefully have a solution. Instead of passing "#results' through params I started a session with the base url+all other params, joined that with "#results' and then ran it through a 2nd get.

statcast_url = 'https://baseballsavant.mlb.com/statcast_search/csv?'
results_url = '&type=#results&'

def _get_statcast_results(params):

    s = session()
    _get = s.get(statcast_url, params=params, timeout=30, allow_redirects=True)

    new_url = _get.url+results_url
    data = s.get(new_url, timeout=30)

    return data.content

Still need to run through some more trials but I think this should work. Thanks to everyone who chimed in. Even though I didn't get a direct answer the responses still helped a ton.

Nick
  • 367
  • 4
  • 16
  • See the edit to my answer. My bet is it will work, and if so it will save you the second request. – yuvgin Apr 03 '19 at 13:12