2

The following works for 99.999% of websites but randomly found one for which it does not:

import requests
requests.get('http://arboleascity.com',timeout=(5,5),verify=False)

I have filed an issue on the project.

https://github.com/requests/requests/issues/4276

Any suggestions or ideas?

I am running this in concurrent.futures.ThreadPoolExecutor so I don't really want to add in something external like eventlets or signals. But open to anything that works well.

Glen Thompson
  • 9,071
  • 4
  • 54
  • 50
  • Is the page very/infinitely long? – Nick T Sep 05 '17 at 19:12
  • Go to it, I can't see anything wrong with the site. Loads super fast, fairly small size of content. – Glen Thompson Sep 05 '17 at 19:12
  • 3
    Curl it, I can't see it ever end. It looks like the page endlessly streams at 8 kBps – Nick T Sep 05 '17 at 19:14
  • IMHO a success rate of the 0.0001% failure rate could also come from a network hick up. anyhow since your're successful the most time, I think it is an issue for github – endo.anaconda Sep 05 '17 at 19:15
  • Nick: you are right, hmm any suggestions on how to limit this with requests library? – Glen Thompson Sep 05 '17 at 19:16
  • 1
    Possible duplicate of [Python requests, how to limit received size, transfer rate, and/or total time?](https://stackoverflow.com/questions/22346158/python-requests-how-to-limit-received-size-transfer-rate-and-or-total-time) – Nick T Sep 05 '17 at 19:18
  • 1
    If you're "randomly finding" websites, you probably need more containment if you have literally no idea what sort of data you're going to pull down. – Nick T Sep 05 '17 at 19:23
  • Nick: I think you are right, about the duplicate question, although it seems slightly different to me, in the sense that my question is specific to that site. Although the answer I am really looking for is something more robust that is presented in the duplicate question. – Glen Thompson Sep 05 '17 at 19:33

3 Answers3

3

It streams a SHOUTcast stream (Content-Type: audio/aacp), so there is no timeout, it just never stops streaming.

If you wanted the homepage, and not the stream, set the User-Agent header to something browser-like. If you wanted the audiostream, use stream=True and iterate the content - here you can also bail out if wanted.

If you are writing a scraper, you might want to check for content-type in a HEAD request before trying to fetch responses that are chunked.

scandinavian_
  • 2,496
  • 1
  • 17
  • 19
1

They're both working exactly as documented.

The connect timeout is the number of seconds Requests will wait for your client to establish a connection to a remote machine (corresponding to the connect()) call on the socket. It's a good practice to set connect timeouts to slightly larger than a multiple of 3, which is the default TCP packet retransmission window.

Once your client has connected to the server and sent the HTTP request, the read timeout is the number of seconds the client will wait for the server to send a response. (Specifically, it's the number of seconds that the client will wait between bytes sent from the server. In 99.9% of cases, this is the time before the server sends the first byte).

Community
  • 1
  • 1
  • 1
    Thanks, good to know. So my question title is incorrectly tied to my problem but doesn't really help me, still up-vote as what you say is true. – Glen Thompson Sep 05 '17 at 19:25
1

The problem is not with the requests, but with the way you're accessing that specific site.

Namely, seems like http://arboleascity.com uses User-Agent header field to differentiate browsers from music players.

If you use a valid browser signature, it just returns the page HTML (text/html) and closes the connection:

$ curl -vvv -A 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0' http://arboleascity.com >/dev/null
...
< Content-Type: text/html;charset=utf-8
...
100   118    0   118    0     0    297      0 --:--:-- --:--:-- --:--:--   297
* Connection #0 to host arboleascity.com left intact

However, if you leave User-Agent undefined (the default), the site streams binary content (audio/aacp) at ~8kbps:

$ curl -vvv http://arboleascity.com >/dev/null
...
< Content-Type: audio/aacp
...
< icy-notice1: <BR>This stream requires <a href="http://www.winamp.com">Winamp</a><BR>
< icy-notice2: SHOUTcast DNAS/posix(linux x64) v2.5.1.724<BR>
...
100  345k    0  345k    0     0  26975      0 --:--:--  0:00:13 --:--:--  7118^C

Or, with requests:

>>> headers = {'user-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0'}
>>> r = requests.get('http://arboleascity.com', headers=headers)
randomir
  • 17,989
  • 1
  • 40
  • 55
  • Thanks, you are right although similar to antii, this doesn't really help me. Still appreciated though so upvote. – Glen Thompson Sep 05 '17 at 19:35
  • Any suggestions on how to prevent streaming using requests? – Glen Thompson Sep 05 '17 at 19:41
  • If you wish to handle the streaming case (limit the received content size), see the linked [possible duplicate](https://stackoverflow.com/questions/22346158/python-requests-how-to-limit-received-size-transfer-rate-and-or-total-time). – randomir Sep 05 '17 at 19:42