I’ve been having problems with speeding up Pywikibot. I’ve seen related questions here on StackOverflow, but they only partially apply to my problem:
- I set
throttle=False
wherever I could, but the bot is still very slow. I can’t use the
PreloadingPageGenerator
like proposed here because I am not using the Bot to access Wikipedia but Wikidata. In my case, the requests look something like thisfrom pywikibot.data import api request_parameters = { 'action': 'wbsearchentities', 'format': 'json', 'language': language, 'type': 'item', 'search': name, 'throttle': False } request = api.Request(site=self.wikidata_site, use_get=True, **request_parameters) response = request.submit()
I now tried to use multiprocessing
so multiple requests can be sent to the API at once which removes the necessity to wait for the response before you can proceed with the next request, which looks like this
while not queue.empty(): # Queue holding data for requests
job_data = [queue.get() for i in range(number_of_processes)]
jobs = [
multiprocessing.Process(
target=search_for_entity,
args=(name, language)
)
for name, language in job_data
]
for job in jobs:
job.start()
for job in jobs:
job.join()
But the moment I run the program, it doesn’t even finish the first request because it gets stuck. I followed the bug to pywikibot/data/api.py:1500 submit()
:
rawdata = http.request(
site=self.site, uri=uri, method='GET' if use_get else 'POST',
body=body, headers=headers)
through pywikibot/comms/http.py:361 fetch()
:
request = _enqueue(uri, method, body, headers, **kwargs)
request._join()
to pywikibot/comms/threadedhttp.py:359 _join()
, where an acquired lock never seems to get released
def _join(self):
"""Block until response has arrived."""
self.lock.acquire(True)
My question now it: Is this a bug of pywikibot
? Have I applied multiprocessing
to this problem in a wrong way? Are the any other solutions in my specific situation to speed up pywikibot
?