3

I want to get data from multiples pages about 10000 pages with number arrays. But one by one is taking so long and I'm new in Python so I don't know much about multithreading and asychronism in this language

The code works fine, it takes all the data expected, but it takes several minutes to do this. And I know that it could probably be done faster if I'd do more than a request per time

import http.client
import json

def get_all_data():
    connection = http.client.HTTPConnection("localhost:5000")
    page = 1
    data = {}

    while True:
        try:

            api_url = f'/api/numbers?page={page}'
            connection.request('GET', api_url)
            response = connection.getresponse()

            if(response.status is 200):
                data[f'{page}'] = json.loads(response.read())['numbers']
                items_returned = len(data[f'{page}'])
                print(f'Por Favor, Aguarde. Obtendo os Dados... Request: {page} -- Itens Retornados: {items_returned}')
                page += 1
                if items_returned == 0 or items_returned == None :
                    break
    except:
        connection.close()

print('Todas as Requisições Concluídas!')
return data

How to refactor this code to do multiple requests at once sequentially instead one by one?

Douglas Ferreira
  • 434
  • 1
  • 7
  • 23
  • Take a look at [How to use threading in Python?](https://stackoverflow.com/questions/2846653/how-to-use-threading-in-python) – hostingutilities.com Jan 11 '19 at 06:33
  • I understand what was going on in https://stackoverflow.com/a/2846697/5921486 but in that case he was creating multiple threads for multiples URLs. In my case I'm always using the same url just changing the params. How deal with it??? Because to increment the page param to go to the next I got to have a positive http response first... – Douglas Ferreira Jan 11 '19 at 06:54
  • If you absolutely must wait for a positive response before you do another request, then... you can't do another request. What your wanting is impossible under those circumstance. But I don't really see why it would be bad to request page 2 before you get a positive response from page 1. I found [this library](https://github.com/ross/requests-futures) that is meant for doing simultaneous HTTP requests. – hostingutilities.com Jan 12 '19 at 17:44

2 Answers2

1

Basically there are three ways of doing this kind of job, multithreading, multiprocessing, and async way, as mentioned by ACE the page parameter exists because of server dynamically generate template and number of pages may change over time due to the database update. the easiest way of doing this can be batch job, and try to put each batch into a try exception block, and handling the last part(not enough for one batch) separately. you can set the numer of jobs in each batch as a variable and try different solutions.

minglyu
  • 2,958
  • 2
  • 13
  • 32
0

Your parameter page (producer) is dynamic and it relies on the last request (consumer). Unless you can separate the producer, you can't use coroutines or multithreading.

ACE Fly
  • 305
  • 2
  • 8