12

I'm working with a process which is basically as follows:

  1. Take some list of urls.
  2. Get a Response object from each.
  3. Create a BeautifulSoup object from the text of each Response.
  4. Pull the text of a certain tag from that BeautifulSoup object.

From my understanding, this seems ideal for grequests:

GRequests allows you to use Requests with Gevent to make asynchronous HTTP Requests easily.

But yet, the two processes (one with requests, one with grequests) seem to be getting me different results, with some of the requests in grequests returning None rather than a response.

Using requests

import requests

tickers = [
    'A', 'AAL', 'AAP', 'AAPL', 'ABBV', 'ABC', 'ABT', 'ACN', 'ADBE', 'ADI', 
    'ADM',  'ADP', 'ADS', 'ADSK', 'AEE', 'AEP', 'AES', 'AET', 'AFL', 'AGN', 
    'AIG', 'AIV', 'AIZ', 'AJG', 'AKAM', 'ALB', 'ALGN', 'ALK', 'ALL', 'ALLE',
    ]

BASE = 'https://finance.google.com/finance?q={}'

rs = (requests.get(u) for u in [BASE.format(t) for t in tickers])
rs = list(rs)

rs
# [<Response [200]>,
 # <Response [200]>,
 # <Response [200]>,
 # <Response [200]>,
 # <Response [200]>,
 # <Response [200]>,
 # ...
 # <Response [200]>]

# All are okay (status_code == 200)

Using grequests

# Restarted my interpreter and redefined `tickers` and `BASE`
import grequests

rs = (grequests.get(u) for u in [BASE.format(t) for t in tickers])
rs = grequests.map(rs)

rs
# [None,
 # <Response [200]>,
 # None,
 # None,
 # None,
 # None,
 # None,
 # None,
 # None,
 # None,
 # None,
 # None,
 # None,
 # None,
 # None,
 # None,
 # None,
 # None,
 # <Response [200]>,
 # <Response [200]>,
 # <Response [200]>,
 # <Response [200]>,
 # <Response [200]>,
 # <Response [200]>,
 # <Response [200]>,
 # <Response [200]>,
 # <Response [200]>,
 # <Response [200]>,
 # <Response [200]>,
 # <Response [200]>]

Why the difference in results?

Update: I can print the exception type as follows. Related discussion here but I have no idea what's going on.

def exception_handler(request, exception):
    print(exception)

rs = grequests.map(rs, exception_handler=exception_handler)

# ("bad handshake: SysCallError(-1, 'Unexpected EOF')",)
# ("bad handshake: SysCallError(-1, 'Unexpected EOF')",)
# ("bad handshake: SysCallError(-1, 'Unexpected EOF')",)
# ("bad handshake: SysCallError(-1, 'Unexpected EOF')",)
# ("bad handshake: SysCallError(-1, 'Unexpected EOF')",)
# ("bad handshake: SysCallError(-1, 'Unexpected EOF')",)
# ("bad handshake: SysCallError(-1, 'Unexpected EOF')",)
# ("bad handshake: SysCallError(-1, 'Unexpected EOF')",)
# ("bad handshake: SysCallError(-1, 'Unexpected EOF')",)
# ("bad handshake: SysCallError(-1, 'Unexpected EOF')",)
# ("bad handshake: SysCallError(-1, 'Unexpected EOF')",)
# ("bad handshake: SysCallError(-1, 'Unexpected EOF')",)
# ("bad handshake: SysCallError(-1, 'Unexpected EOF')",)
# ("bad handshake: SysCallError(-1, 'Unexpected EOF')",)
# ("bad handshake: SysCallError(-1, 'Unexpected EOF')",)

System/version info

  • requests: 2.18.4
  • grequests: 0.3.0
  • Python: 3.6.3
  • urllib3: 1.22
  • pyopenssl: 17.2.0
  • All via Anaconda
  • System: same issue on both Mac OSX HS & Windows 10, build 10.0.16299
Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
  • 1
    If you look at the [README](https://github.com/kennethreitz/grequests) it suggests that failed requests result in a `None`. I'm guessing that Google is getting angry when you make too many unauthenticated requests all at once. Reading down slightly more in the README describes how to write an exception handler that would tell you what's going on. – Nick T Sep 13 '17 at 19:35
  • 1
    Print the exception, rather than a fixed string – Nick T Sep 13 '17 at 20:03
  • 1
    If it's a system thing, you may need to include more information like your OS and it's version, Python version/build, and versions of requests, grequests, urllib3, PyOpenSSL (if installed). Sounds more like a bug report then... – Nick T Sep 13 '17 at 20:30
  • 3
    you could try to limit gevent concurrency with `grequests.map(rs, size=2)` – georgexsh Dec 07 '17 at 17:37
  • 5
    I see this comment on the [github site](https://github.com/kennethreitz/grequests): "**Note**: You should probably use requests-threads or requests-futures instead." Also, the last code update appears to be 2 years ago. –  Dec 07 '17 at 18:09
  • It might be related to the `htts` part of the query; meaning the 3DES/TLS secure(d) connection can't be established. This blog post mentions this connection type to be insecure for bulk transfer and prolonged connections, which your map call might be. https://lukasa.co.uk/2017/02/Configuring_TLS_With_Requests/ – fabianegli Dec 07 '17 at 20:11
  • Based on the use case u mentioned I would use Scrapy (www.scrapy.org). With it you can write A web crawler in a simple manner. You can checkout my amazoncrawler here as an example: https://github.com/Kitzi/crawler Scrapy is also Python-based so you will receive quick results – Jakob Dec 14 '17 at 14:57

2 Answers2

10

You are just sending requests too fast. As grequests is an async lib, all of these requests are almost sent simultaneously. They are too many.

You just need to limit the concurrent tasks by grequests.map(rs, size=your_choice), I have tested grequests.map(rs, size=10) and it works well.

Sraw
  • 18,892
  • 11
  • 54
  • 87
  • 1
    What does "too fast" mean? Where is the bottleneck, what is the limitation? Is it measurable or can it be optimized for? Why do you figure size=10 is optimal for your machine, and how do you find the value on other machines? – advance512 Jun 17 '20 at 22:24
  • 1
    The "fast" is to the server, the server doesn't want to accept so many requests from one client as it will crash the server. You reduce the speed to show respect to the server, so the server is happy to serve you. – Sraw Jun 18 '20 at 00:24
5

I do not know the exact reason for the observed behavior with .map(). However, using the .imap() function with size=1 always returned a 'Response 200' for my few minutes testing. Here is the code snipet:

rs = (grequests.get(u) for u in [BASE.format(t) for t in tickers])
rsm_iterator = grequests.imap(rs, exception_handler=exception_handler, size=1)
rsm_list = [r for r in rsm_iterator]
print(rsm_list)

And if you don't want to wait for all requests to finish before working on their answers, you can do this like so:

rs = (grequests.get(u) for u in [BASE.format(t) for t in tickers])
rsm_iterator = grequests.imap(rs, exception_handler=exception_handler, size=1)
for r in rsm_iterator:
    print(r)
fabianegli
  • 2,056
  • 1
  • 18
  • 35