I'm trying to crawl a website to get the info about vehicles. I want to get all vehicles from that site. I want to repeat the process every day because every day there are new vehicles.
There are a lot of cars, more than 100 thousand. Thus doing it once (in one process) would take too much time and it can't be done in that way.
Thus I need to do it in more smaller processes instead of in one big process.
If I understand correctly that can be done with IBM Cloud functions.
I could for example for every make, and for every model of that make call an action to get the list of cars.
That way I would have (instead of one big process) many smaller processes and it would take less time.
The idea is as follows:
- Call an action which will get all
makes
and loop through them. And for every make, first create and action and then call it
The code is as follows:
import sys
import os
import json
import requests
import http.client
import uuid
API_URL = "https://url.com"
APIHOST = os.environ.get('__OW_API_HOST')
NAMESPACE = os.environ.get('__OW_NAMESPACE')
USER_PASS = os.environ.get('__OW_API_KEY').split(':')
code = "New function code"
makes = [
{"id": 9,"name": "Audi"},
{"id": 74,"name": "Volkswagen"}
]
def main(dict):
conn = http.client.HTTPSConnection("openwhisk.eu-gb.bluemix.net")
payload = json.dumps({"exec": {"kind": "python-jessie:3", "code": code}})
headers = {
'accept': "application/json",
'content-type': "application/json",
'Authorization': "Basic my-base64key"
}
for make in makes:
action = 'models-{0}'.format(make['name'])
url = APIHOST + '/api/v1/namespaces/' + NAMESPACE + '/actions/' + action + "?overwrite=true"
conn.request("PUT", url, payload, headers) // Create new action
// Execute the new action
return {"Success": "Main executed correctly."}
The problem is in for
loop. If there is only one make then it works fine. But if there are two or more it doesn't work. I get an error as follows:
[
"2018-07-11T08:53:06.322665342Z stderr: Traceback (most recent call last):",
"2018-07-11T08:53:06.322685254Z stderr: File \"pythonrunner.py\", line 88, in run",
"2018-07-11T08:53:06.322692936Z stderr: exec('fun = %s(param)' % self.mainFn, self.global_context)",
"2018-07-11T08:53:06.322699124Z stderr: File \"<string>\", line 1, in <module>",
"2018-07-11T08:53:06.322705761Z stderr: File \"__main__.py\", line 71, in main",
"2018-07-11T08:53:06.322712082Z stderr: File \"/usr/local/lib/python3.6/http/client.py\", line 1239, in request",
"2018-07-11T08:53:06.322718524Z stderr: self._send_request(method, url, body, headers, encode_chunked)",
"2018-07-11T08:53:06.322724518Z stderr: File \"/usr/local/lib/python3.6/http/client.py\", line 1250, in _send_request",
"2018-07-11T08:53:06.322730924Z stderr: self.putrequest(method, url, **skips)",
"2018-07-11T08:53:06.322736931Z stderr: File \"/usr/local/lib/python3.6/http/client.py\", line 1108, in putrequest",
"2018-07-11T08:53:06.322742876Z stderr: raise CannotSendRequest(self.__state)",
"2018-07-11T08:53:06.322748626Z stderr: http.client.CannotSendRequest: Request-sent"
]
Any idea how can I do those requests inside for loop if there are two or more records?