2

I'm trying to crawl a website to get the info about vehicles. I want to get all vehicles from that site. I want to repeat the process every day because every day there are new vehicles.

There are a lot of cars, more than 100 thousand. Thus doing it once (in one process) would take too much time and it can't be done in that way.

Thus I need to do it in more smaller processes instead of in one big process.

If I understand correctly that can be done with IBM Cloud functions.

I could for example for every make, and for every model of that make call an action to get the list of cars.

That way I would have (instead of one big process) many smaller processes and it would take less time.

The idea is as follows:

  • Call an action which will get all makes and loop through them. And for every make, first create and action and then call it

The code is as follows:

import sys
import os
import json
import requests
import http.client
import uuid

API_URL = "https://url.com"
APIHOST = os.environ.get('__OW_API_HOST')
NAMESPACE = os.environ.get('__OW_NAMESPACE')
USER_PASS = os.environ.get('__OW_API_KEY').split(':')

code = "New function code"

makes = [
    {"id": 9,"name": "Audi"},
    {"id": 74,"name": "Volkswagen"}
]

def main(dict):
    conn = http.client.HTTPSConnection("openwhisk.eu-gb.bluemix.net")
    payload = json.dumps({"exec": {"kind": "python-jessie:3", "code": code}})
    headers = {
        'accept': "application/json",
        'content-type': "application/json",
        'Authorization': "Basic my-base64key"
    }

    for make in makes:
        action = 'models-{0}'.format(make['name'])
        url = APIHOST + '/api/v1/namespaces/' + NAMESPACE + '/actions/' + action + "?overwrite=true"

        conn.request("PUT", url, payload, headers) // Create new action
        // Execute the new action

    return {"Success": "Main executed correctly."}

The problem is in for loop. If there is only one make then it works fine. But if there are two or more it doesn't work. I get an error as follows:

[
    "2018-07-11T08:53:06.322665342Z stderr: Traceback (most recent call last):",
    "2018-07-11T08:53:06.322685254Z stderr: File \"pythonrunner.py\", line 88, in run",
    "2018-07-11T08:53:06.322692936Z stderr: exec('fun = %s(param)' % self.mainFn, self.global_context)",
    "2018-07-11T08:53:06.322699124Z stderr: File \"<string>\", line 1, in <module>",
    "2018-07-11T08:53:06.322705761Z stderr: File \"__main__.py\", line 71, in main",
    "2018-07-11T08:53:06.322712082Z stderr: File \"/usr/local/lib/python3.6/http/client.py\", line 1239, in request",
    "2018-07-11T08:53:06.322718524Z stderr: self._send_request(method, url, body, headers, encode_chunked)",
    "2018-07-11T08:53:06.322724518Z stderr: File \"/usr/local/lib/python3.6/http/client.py\", line 1250, in _send_request",
    "2018-07-11T08:53:06.322730924Z stderr: self.putrequest(method, url, **skips)",
    "2018-07-11T08:53:06.322736931Z stderr: File \"/usr/local/lib/python3.6/http/client.py\", line 1108, in putrequest",
    "2018-07-11T08:53:06.322742876Z stderr: raise CannotSendRequest(self.__state)",
    "2018-07-11T08:53:06.322748626Z stderr: http.client.CannotSendRequest: Request-sent"
]

Any idea how can I do those requests inside for loop if there are two or more records?

Employee
  • 3,109
  • 5
  • 31
  • 50
Boky
  • 11,554
  • 28
  • 93
  • 163
  • You should get the response before than sending new requests. I think this link may be helpful [link1](https://stackoverflow.com/questions/1925639/httplib-cannotsendrequest-error-in-wsgi) – Domenico Vito Scalera Jul 11 '18 at 10:19

2 Answers2

0

Splitting the crawler process into separate action invocations is a sensible approach to running this work on IBM Cloud Functions.

However, it would be better to have a single action that uses event parameters to determine the car make and model to crawl, rather than having a separate action for each make and model.

The code above, which iterates over the model & model list, can then invoke that single action multiple times with different event parameters, rather than trying to creating a new action per item.

James Thomas
  • 4,303
  • 1
  • 20
  • 26
0

Were you able to resolve the issue?. For me it looks like the issue with your http.client library is not compatible to run inside Cloud Functions due to its asynchronous invocation behavior.

According to Python documentation, you shouldn't use this library directly. Instead use highly recommended requests module.

http.client — HTTP protocol client Source code: Lib/http/client.py

This module defines classes which implement the client side of the HTTP and HTTPS protocols. It is normally not used directly — the module urllib.request uses it to handle URLs that use HTTP and HTTPS.

See also The Requests package is recommended for a higher-level HTTP client interface.

https://docs.python.org/3/library/http.client.html

Imran
  • 5,542
  • 3
  • 23
  • 46