-1

I am trying to build a web-crawler on GAE using Flask and Python. I am no expert in building web-apps.

So, I created a simple main page which has two buttons saying 'Single' and 'List' which will take you to pages where you can enter a URL and upload a CSV file of URLs respectively.

Now, the single URL part is pretty straightforward but the list part is tricky. Say, I upload a CSV file of 'n' URLs and I want each of them to call 'Single' part (maybe n calls) and all the calls need to be parallel like multiprocessing/threading.

How do I go about this? Googling led me to task queues and I am reading about that. But I want to know which is the best way to go about this and any examples would be greatly appreciated.

Thanks in advance.

davidism
  • 121,510
  • 29
  • 395
  • 339
Amruth Kiran
  • 31
  • 1
  • 8
  • Have you read about Scrapy? You can integrate it with your application, and handle all the requests for you. – Shlomi Bazel Jan 28 '19 at 12:25
  • Yes, but I can't change my crawler and I need to use my own crawling function. I just need to know about the parallel calls to one of the service from another service of the same app on Google App Engine. – Amruth Kiran Jan 28 '19 at 12:30

2 Answers2

0

You can use a Pool. It's in the standard multiprocessing lib.

It takes an iterable and apply a function to each element in parallel.

from multiprocessing import Pool

def f(x):
    return x*x

if __name__ == '__main__':
    with Pool(5) as p:
        print(p.map(f, [1, 2, 3]))
Benoît P
  • 3,179
  • 13
  • 31
  • But can we do multiprocessing (using Pool workers) on a Flask app hosted on GAE? I read somewhere that it's not possible. – Amruth Kiran Jan 28 '19 at 12:51
0

I believe that what you need to do is to implement asynchronous HTTP requests, this way you can send all n URL requests at the same time, and your script execution won't be blocked while you are waiting for the requests to finish.

The reason why you found that Cloud Tasks would be a way to implement this, is because they run asynchronously by design.

However, if you need to take the response of one of the HTTP requests in the same runtime of the script, it would be better to use urlfetch. From this documentation, which also includes some code examples:

HTTP(S) requests are synchronous by default. To issue an asynchronous request, your application must:

  1. Create a new RPC object using urlfetch.create_rpc(). This object represents your asynchronous call in subsequent method calls.
  2. Call urlfetch.make_fetch_call() to make the request. This method takes your RPC object and the request target's URL as parameters.
  3. Call the RPC object's get_result() method. This method returns the result object if the request is successful, and raises an exception if an error occurred during the request.

You can implement this using any one of the both options, the first will allow you to create and manage the requests in a separate API, which would be called from within your application; while the other runs natively in App Engine (Standard), and may be more straightforward to implement.

Joan Grau Noël
  • 3,084
  • 12
  • 21