0

I am trying to retrieve 10 images through an API which returns JSON data by first making a request to the API and then storing 10 image URLs from the returned JSON data in a list. In my original iteration I made individual requests to those urls and saved the response content to file. My code is given below with my API key removed for obvious reasons:

def get_image(search_term):

    number_images = 10
    images = requests.get("https://pixabay.com/api/?key=insertkey&q={}&per_page={}".format(search_term,number_images))
    images_json_dict = images.json()

    hits = images_json_dict["hits"]
    urls = []
    for i in range(len(hits)):
        urls.append(hits[i]["webformatURL"])

    count =0
    for url in urls:
        picture_request = requests.get(url)
        if picture_request.status_code == 200:
            try:
                with open(dir_path+r'\\images\\{}.jpg'.format(count),'wb') as f:
                    f.write(picture_request.content)
            except:
                    os.mkdir(dir_path+r'\\images\\')
                    with open(dir_path+r'\\images\\{}.jpg'.format(count),'wb') as f:
                        f.write(picture_request.content)
        count+=1

This was working fine apart from the fact that it was very slow. It took maybe 7 seconds to pull in those 10 images and save in a folder. I read here that it's possible to use Sessions() in the requests library to improve performance - I'd like to have those images as quickly as possible. I've modified the code as shown below however the problem I'm having is that the get request on the sessions object returns a requests.sessions.Session object rather than a response code and there is also no .content method to retrieve the content (I've added comments to the relevant lines of code below). I'm relatively new to programming so I'm uncertain if this is even the best way to do this. My question is how can I use sessions to retrieve back the image content now that I am using Session() or is there some smarter way to do this?

def get_image(search_term):

    number_images = 10
    images = requests.get("https://pixabay.com/api/?key=insertkey&q={}&per_page={}".format(search_term,number_images))
    images_json_dict = images.json()

    hits = images_json_dict["hits"]
    urls = []
    for i in range(len(hits)):
        urls.append(hits[i]["webformatURL"])

    count =0
    #Now using Session()
    picture_request = requests.Session()
    for url in urls:
        picture_request.get(url)
        #This will no longer work as picture_request is an object
        if picture_request == 200:
            try:
                with open(dir_path+r'\\images\\{}.jpg'.format(count),'wb') as f:
                    #This will no longer work as there is no .content method
                    f.write(picture_request.content)
            except:
                    os.mkdir(dir_path+r'\\images\\')
                    with open(dir_path+r'\\images\\{}.jpg'.format(count),'wb') as f:
                        #This will no longer work as there is no .content method
                        f.write(picture_request.content)
        count+=1
Blargian
  • 294
  • 3
  • 14
  • 1
    You may use sessions but it not actually improve performance. What you need is downloading images in parallel, for that you may use or `threads` library to run multiple instances, or instead of `requests` use `aiohttp` for async requests. The second one will have more performance if you really need it. I may write example on any of this you would prefer. – XCanG Dec 06 '19 at 09:25
  • @XCanG It would be helpful thank you. I've been reading through some tutorials on threading but I'm not familiar with how it works or how I could apply it in this particular situation. – Blargian Dec 07 '19 at 08:52
  • so you want to use `threading` with `requests`? – XCanG Dec 07 '19 at 15:46

1 Answers1

0

Assuming you want to stick with requests library, then you will need to use threading for create multiple parallel instances.

Library concurrent.futures have convenient constructor for creating multiple threads with concurrent.futures.ThreadPoolExecutor.

fetch() used for downloading images. fetch_all() used for creating thread pool, you can select how many threads you want to run by passing threads argument. get_urls() is your function for retrieving list of urls. You should pass your token (key) and search_term.

Note: In case you have Python older, than 3.7, you should replace f-strings (f"{args}") to regular formatting functions ("{}".format(args)).

import os
import requests
from concurrent import futures


def fetch(url, session = None):
    if session:
        r = session.get(url, timeout = 60.)
    else:
        r = reqests.get(url, timeout = 60.)
    r.raise_for_status()

    return r.content


def fetch_all(urls, session = requests.session(), threads = 8):
    with futures.ThreadPoolExecutor(max_workers = threads) as executor:
        future_to_url = {executor.submit(fetch, url, session = session): url for url in urls}
        for future in futures.as_completed(future_to_url):
            url = future_to_url[future]
            if future.exception() is None:
                yield url, future.result()
            else:
                print(f"{url} generated an exception: {future.exception()}")
                yield url, None


def get_urls(search_term, number_images = 10, token = "", session = requests.session()):
    r = requests.get(f"https://pixabay.com/api/?key={token}&q={search_term}&per_page={number_images}")
    r.raise_for_status()
    urls = [hit["webformatURL"] for hit in r.json().get("hits", [])]

    return urls


if __name__ == "__main__":
    root_dir = os.getcwd()
    session = requests.session()
    urls = get_urls("term", token = "token", session = session)

    for url, content in fetch_all(urls, session = session):
        if content is not None:
            f_dir = os.path.join(root_dir, "images")
            if not os.path.isdir(f_dir):
                os.makedirs(f_dir)
            with open(os.path.join(f_dir, os.path.basename(url)), "wb") as f:
                f.write(content)

Also I recommending you to look at aiohttp. I will not provide example here, but give you link to one article for similar task, where you can read more about it.

XCanG
  • 429
  • 6
  • 16