1

I am trying to request the batch log-level data from Appnexus api. Based on the official data service guide, there are four main steps:

1. Account Authentication -> return token in Json

2. GET available data feeds list and look up download parameters -> return parameters in Json

3. GET Request file download location code by passing download parameters -> extract location code from header

4. GET download log data file by passing location code -> return gz data file

Those steps work perfectly in Terminal using curl:

curl -b cookies -c cookies -X POST -d @auth 'https://api.appnexus.com/auth'
curl -b cookies -c cookies 'https://api.appnexus.com/siphon?siphon_name=standard_feed'
curl --verbose -b cookies -c cookies 'https://api.appnexus.com/siphon-download?siphon_name=standard_feed&hour=2017_12_28_09&timestamp=20171228111358&member_id=311&split_part=0'
curl -b cookies -c cookies 'http://data-api-gslb.adnxs.net/siphon-download/[location code]' > ./data_download/log_level_feed.gz

In Python, I was trying same thing to test the api. However, it keeps giving me the "ConnectionError". In steps 1-2, it still works well so that I successfully got the parameters from the Json response to build the url for step 3 in which i need to request location code and extract it from the response's header.

Step1:

# Step 1
############ Authentication ###########################    
# Select End-Point
auth_endpoint = 'https://api.appnexus.com/auth'

# API Key
auth_app = json.dumps({'auth':{'username':'xxxxxxx','password':'xxxxxxx'}})

# Proxy
proxy = {'https':'https://proxy.xxxxxx.net:xxxxx'}
r = requests.post(auth_endpoint, proxies=proxy, data=auth_app)
data = json.loads(r.text)
token = data['response']['token']

Step2:

# Step 2
########### Check report list ###################################
check_list_endpoint = 'https://api.appnexus.com/siphon?siphon_name=standard_feed'
report_list = requests.get(check_list_endpoint, proxies=proxy, headers={"Authorization":token})
data = json.loads(report_list.text)
print(str(len(data['response']['siphons'])) + ' previous hours available for download')

# Build url for single report - extract para
download_endpoint = 'https://api.appnexus.com/siphon-download'
siphon_name = 'siphon_name=standard_feed' 
hour = 'hour=' + data['response']['siphons'][400]['hour']
timestamp = 'timestamp=' + data['response']['siphons'][400]['timestamp'] 
member_id = 'member_id=311' 
split_part = 'split_part=' + data['response']['siphons'][400]['splits'][0]['part']

# Build url
download_endpoint_url = download_endpoint + '?' + \
siphon_name + '&' + \
hour + '&' + \
timestamp + '&' + \
member_id + '&' + \
split_part
# Check
print(download_endpoint_url)

Yet, instead of running to complete, the "requests.get" in the following step 3 keeps giving "ConnectionError" warning. In addition, I found that the "location code" is actually in the warning information which is right after "/siphon-download/". So, i use "try..except" to extract it from the warning message and keep the code running.

Step3:

# Step 3
######### Extract location code for target report ####################
try:
    TT = requests.get(download_endpoint_url, proxies=proxy, headers={"Authorization":token}, timeout=1)
except ConnectionError, e:
    text = e.args[0].args[0]
    m = re.search('/siphon-download/(.+?) ', text)
    if m:
        location = m.group(1)
print('Successfully Extracting location: ' + location)

Original warning message without "try..except" in Step3:

ConnectionError: HTTPConnectionPool(host='data-api-gslb.adnxs.net', port=80): Max retries exceeded with url: 
/siphon-download/dbvjhadfaslkdfa346583 
(Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x0000000007CBC7B8>: 
Failed to establish a new connection: [Errno 10060] A connection attempt failed because the connected party did not 
properly respond after a period of time, or established connection failed because connected host has failed to respond',))

Then, I was trying to make the last GET request with location code that i extracted from previous warning message to download the gz data file as i did using "curl" in terminal. However, I have got the same warning message - ConnectionError.

Step4:

# Step 4
######## Download data file #######################
extraction_location = 'http://data-api-gslb.adnxs.net/siphon-download/' + location
LLD = requests.get(extraction_location, proxies=proxy, headers={"Authorization":token}, timeout=1)

Original warning message in Step4:

ConnectionError: HTTPConnectionPool(host='data-api-gslb.adnxs.net', port=80): Max retries exceeded with url: 
/siphon-download/dbvjhadfaslkdfa346583 
(Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x0000000007BE15C0>: 
Failed to establish a new connection: [Errno 10060] A connection attempt failed because the connected party did not 
properly respond after a period of time, or established connection failed because connected host has failed to respond',))

To double check, I tested all the endpoints, parameters, and location code generated in my Python script in terminal using curl. They all work fine and the data downloaded is correct. Can anybody help me solve this issue in Python or point me at the right direction to discover why this is happening? Many thanks!

Mark Li
  • 429
  • 1
  • 7
  • 21
  • Why don't you use proxy with curl? – Oleg Kuralenko Jan 02 '18 at 20:12
  • @ffeast Sry for the confusion. Yes, I also used proxy in curl as well. – Mark Li Jan 03 '18 at 16:24
  • what if you set timeout=100 instead of timeout=1 or enforce 1 second timeout for curl? – Oleg Kuralenko Jan 03 '18 at 18:04
  • Because the server does not give you information about the file size. Non-block download requires additional packet communication, some of the Python modules do not support it. Additionally, some information that is required on the server side may be missing. It is very different to never get data and get a certain part of it. If you put the same file on your local server and try to download it, it will guide you. – dsgdfg Jan 03 '18 at 19:39
  • Why in Step 3 is it showing a `ConnectionError` from a host that was only introduced in Step 4? – Hetzroni Jan 04 '18 at 23:19

1 Answers1

1

1) In curl you are reading and writing cookies (-b cookies -c cookies). With requests you are not using session objects http://docs.python-requests.org/en/master/user/advanced/#session-objects so your cookie data is lost.

2) You define a https proxy and then you are trying to connect over http with no proxy (to data-api-gslb.adnxs.net). Define both http and https, but only once on the session object. See http://docs.python-requests.org/en/master/user/advanced/#proxies. (This is probably the root cause of the error message you see.)

3) Requests handles redirects automatically there is no need to extract the location header and use it in the next request, it will automatically be redirected. So there are 3 steps not 4 when the other errors are fixed. (This also answers Hetzroni's question in the comments above.)

So use

s = requests.Session() 
s.proxies = {
               'http':'http://proxy.xxxxxx.net:xxxxx',
               'https':'https://proxy.xxxxxx.net:xxxxx'
             } # set this only once using valid proxy urls.

then use

s.get() 

and

s.post() 

instead of

requests.get() 

and

requests.post() 
Dan-Dev
  • 8,957
  • 3
  • 38
  • 55
  • That's right! It works! I see. In the process, it calls 'http://' as well as 'https://'. I need to set both as proxy. Very well explained! Thank you! – Mark Li Jan 08 '18 at 16:02
  • One more follow-up question - how can I download the zipped data file from the request 3 result? "result.text"? – Mark Li Jan 08 '18 at 17:09
  • I think what your asking is addressed here https://stackoverflow.com/questions/9419162/python-download-returned-zip-file-from-url use the most voted for answer. – Dan-Dev Jan 08 '18 at 17:23
  • Thanks for the quick answer! Unfortunately, when I copied the code, it returns "File is not a zip file". The "StringIO.StringIO(r.content)" is an instance. – Mark Li Jan 08 '18 at 17:44
  • Without access to the file myself it is hard to say what it is. If you download it manually what does it have for a file extension? – Dan-Dev Jan 08 '18 at 17:54
  • I used ".gz" extension. This is the curl command I used to download the file - "curl --proxy 'https://proxy.xxxxx.net:xxxxx' -b cookies -c cookies 'http://data-api-gslb.adnxs.net/siphon-download/sdfasdfasdfgbsxfgh' > ./Desktop/result.gz". – Mark Li Jan 08 '18 at 18:00
  • OK my mistake I thought it was a zip file for some reason. Requests automatically decompresses gzip-encoded responses ... You can get direct access to the raw response (and even the socket), if needed as well. See https://stackoverflow.com/questions/13137817/how-to-download-image-using-requests/13137873 for an example – Dan-Dev Jan 08 '18 at 18:08
  • Thank you! I successfully downloaded the file. Really appreciate the detail answers! :))) – Mark Li Jan 08 '18 at 18:25