3

I have a list of dictionaries and I'm filling it out as I search a JSON url. The problem is that JSON (provided by the Google Books API) is not always complete. This is a search for books and from what I saw, all of them have id, title and authors, but not all of them have imageLinks. Here's a JSON link as an example: Search for Harry Potter.

Note that it always returns 10 results, in this example there are 10 IDs, 10 titles, 10 authors, but only 4 imageLinks.

@app.route('/search', methods=["GET", "POST"])
@login_required
def search():
    if request.method == "POST":
        while True:
            try:
                seek = request.form.get("seek")
                url = f'https://www.googleapis.com/books/v1/volumes?q={seek}'
                response = requests.get(url)
                response.raise_for_status()
                search = response.json()
                seek = search['items']
                infobooks = []
                for i in range(len(seek)):
                    infobooks.append({
                        "book_id": seek[i]['id'],
                        "thumbnail": seek[i]['volumeInfo']['imageLinks']['thumbnail'],
                        "title": seek[i]['volumeInfo']['title'],
                        "authors": seek[i]['volumeInfo']['authors']
                    })
                return render_template("index.html", infobooks=infobooks)
            except (requests.RequestException, KeyError, TypeError, ValueError):
                continue
    else:
        return render_template("index.html")

The method I used and that I'm demonstrating above, I can find 10 imageLinks (thumbnails) but it takes a long time! Anyone have any suggestions for this request not take so long? Or some way I can insert a "Book Without Cover" image when I can't find an imageLink? (not what I would like, but it's better than having to wait for the results)

ARNON
  • 1,097
  • 1
  • 15
  • 33
  • At least for me, your question is not clear. Do you retrieve only 4 `imageLinks` or 10 `imageLinks`? How is it any related with your performance issue? – Rodrigo Rodrigues Jun 07 '21 at 23:11
  • In the example I provided: [Harry Potter Search](https://www.googleapis.com/books/v1/volumes?q=harry%20potter), there are only 4 imageLinks, but each search can return a number of different imageLinks (thumbnails). – ARNON Jun 07 '21 at 23:29
  • Why do you need to always retrieve a thumbnail? Is a reponse without a thumbnail a useful response? Maybe you can get as many thumbnails as it gets on the first try, then have a separate api to get a specific thumbnail, that you can hit asynchronously and eventualy deliver – Rodrigo Rodrigues Jun 08 '21 at 04:32
  • I will probably have to accept an answer without thumbnails. Of course, I'd like all the results to have thumbnails, but from all the documentation I'm reading, I see it's very unlikely that I'll get this in a short time. – ARNON Jun 08 '21 at 12:11

4 Answers4

5

Firstly your function will never result in 10 imageLinks as the api will always return the same results. So if you retrieved 4 imageLinks the first time it will the same the second time. Unless google updates the dataset, but that is out of your control.

The Google Books Api allows max to 40 results and has default of max 10 results. To increase that you can add the query parameter maxResults=40 where 40 can be any desired number equal or lower than 40. Here you can then decide to programmatically filter out all results without imageLinks, or to leave them and add a no results image url to them. Also not every result returns a list of authors, that has also been fixed in this example. Take no risks with third party api's always check on empty/null results because it can break your code. I have used .get to avoid any exceptions from occurring when processing json.

Although I have not added it in this example you can also use pagination which google books provides to paginate for even more results.

Example:

@app.route('/search', methods=["GET", "POST"])
@login_required
def search():
    if request.method == "POST":
        seek = request.form.get("seek")
        url = f'https://www.googleapis.com/books/v1/volumes?q={seek}&maxResults=40'
        response = requests.get(url)
        response.raise_for_status()
        results = response.json().get('items', [])
        infobooks = []
        no_image = {'smallThumbnail': 'http://no-image-link/image-small.jpeg', 'thumbnail': 'http://no-image-link/image.jpeg'}
        for result in results:
            info = result.get('volumeInfo', {})
            imageLinks = info.get("imageLinks")
            infobooks.append({
                "book_id": result.get('id'),
                "thumbnail": imageLinks if imageLinks else no_image,
                "title": info.get('title'),
                "authors": info.get('authors')
            })
        return render_template("index.html", infobooks=infobooks)
    else:
        return render_template("index.html")

Google Books Api Docs: https://developers.google.com/books/docs/v1/using

ARR
  • 2,074
  • 1
  • 19
  • 28
  • Your solution is very good. I just needed to change the access to the thumbnails by including the line `thumbs = imageLinks.get('thumbnail')` and getting `"thumbnail": thumbs if thumbs else no_image`. I don't know if it's the cleanest way to do it, but it's working. – ARNON Jun 12 '21 at 22:56
1

From your question it was not immediately obvious what the problem is (hence the lack of engagement). After playing around with the code and the API for a bit, I now have a much better understanding of the issue.

The issue is that the Google books API does not always include an image thumbnail for each of the items.

Your current solution for this issue is to retry the entire search until all the fields have an image thumbnail. But think if this is really needed. Maybe you can split it up. In my testing I've seen that the books without image thumbnails often switch. Meaning that if you just keep retrying until all the results from the query have a thumbnail, it will take a long time.

The solution should attempt to query each book individually for the thumbnail. After X number of tries it should default to a 'image available', to avoid spamming the API.

As you already figured out in your post, you can get the volume ID of each book from the original search query. You can then use this API call to query each of those volumes individually.

I've created some code to validate that this works. And only one book does not have an image thumbnail at the end. This code still has a lot of room for improvement, but I'll leave that as an exercise for you.

import requests

# Max attempts to get an image
_MAX_ATTEMPTS = 5

# No Image Picture
no_img_link = 'https://upload.wikimedia.org/wikipedia/en/6/60/No_Picture.jpg'


def search_book(seek):
    url = f'https://www.googleapis.com/books/v1/volumes?q={seek}'
    response = requests.get(url)
    search = response.json()
    volumes = search['items']

    # Get ID's of all the volumes
    volume_ids = [volume['id'] for volume in volumes]

    # Storage for the results
    book_info_collection = []

    # Loop over all the volume ids
    for volume_id in volume_ids:

        # Attempt to get the thumbnail a couple times
        for i in range(_MAX_ATTEMPTS):
            url = f'https://www.googleapis.com/books/v1/volumes/{volume_id}'
            response = requests.get(url)
            volume = response.json()
            try:
                thumbnail = volume['volumeInfo']['imageLinks']['thumbnail']
            except KeyError:
                print(f'Failed for {volume_id}')
                if i < _MAX_ATTEMPTS - 1:
                    # We still have attempts left, keep going
                    continue
                # Failed on the last attempt, use a default image
                thumbnail = no_img_link
                print('Using Default')

            # Create dict with book info
            book_info = {
                "book_id": volume_id,
                "thumbnail": thumbnail,
                "title": volume['volumeInfo']['title'],
                "authors": volume['volumeInfo']['authors']
            }

            # Add to collection
            book_info_collection.append(book_info)
            break

    return book_info_collection


books = search_book('Harry Potter')
print(books)

bobveringa
  • 230
  • 2
  • 12
  • I agree that it is a solution to try 5 attempts before assigning a default image. But this search for Harry Potter, with the code you wrote, took about 20 seconds to generate the final result. This is still not the solution I'm looking for. – ARNON Jun 08 '21 at 01:53
  • In that case, you should clarify your question. It is much too ambiguous. State what your desired result is, as in timing constraints or otherwise regarding retry or direction of general solution. Your question is how the time can be reduced, for which an answer is provided. If that does not answer your question, then you must provide more details as to WHY it does not answer the question. Simply stating "That is not what I am looking for" does not help in finding a solution. Does the solution have to be in python? Or can it be done on the web page, etc. – bobveringa Jun 08 '21 at 08:31
  • I agree with you. I'm looking for a solution primarily in Python. Of course if that's not possible, I'll have to try something in another language. – ARNON Jun 08 '21 at 12:09
1

You have added that you want it to load fast. This means that you cannot do retries in python as any retry you do in python would mean longer page loading times.

This means that you have to do the loading in the browser. You can use the same method as for the pure python method. At first, you just use all the images in the request and make additional requests for all the volumes that did not have an image. This means that you have 2 endpoints, one for the volume_information. And another endpoint to just get the data for one volume.

Note that I am using the term volume instead of book, as that is what the Google API also uses.

Now, JavaScript is not my strong suit so the solution I am providing here has a lot of room for improvement.

I've made this example using flask. This example should help you implement your solution that fits your specific application.

Extra Note: In my testing I've noticed that, some regions more often respond with all thumbnails than others. The API sends different responses based on your IP address. If I set my IP to be in the US I often get all the thumbnails without retries. I am using a VPN to do this, but there may be other solutions.

app.py

import time

from flask import Flask, render_template, request, jsonify
import requests

app = Flask(__name__)


@app.route('/')
def landing():
    return render_template('index.html', volumes=get_volumes('Harry Potter'))


@app.route('/get_volume_info')
def get_volume_info_endpoint():
    volume_id = request.args.get('volume_id')
    if volume_id is None:
        # Return an error if no volume id was provided
        return jsonify({'error': 'must provide argument'}), 400

    # To stop spamming the API
    time.sleep(0.250)
    
    # Request volume data
    url = f'https://www.googleapis.com/books/v1/volumes/{volume_id}'
    response = requests.get(url)
    volume = response.json()

    # Get the info using the helper function
    volume_info = get_volume_info(volume, volume_id)
    
    # Return json object with the info
    return jsonify(volume_info), 200


def get_volumes(search):
    # Make request
    url = f'https://www.googleapis.com/books/v1/volumes?q={search}'
    response = requests.get(url)
    data = response.json()

    # Get the volumes
    volumes = data['items']

    # Add list to store results
    volume_info_collection = []

    # Loop over the volumes
    for volume in volumes:
        volume_id = volume['id']
        
        # Get volume info using helper function
        volume_info = get_volume_info(volume, volume_id)

        # Add it to the result
        volume_info_collection.append(volume_info)
    
    return volume_info_collection


def get_volume_info(volume, volume_id):
    # Get basic information
    volume_title = volume['volumeInfo']['title']
    volume_authors = volume['volumeInfo']['authors']
    
    # Set default value for thumbnail
    volume_thumbnail = None
    try:
        volume_thumbnail = volume['volumeInfo']['imageLinks']['thumbnail']
    except KeyError:
        # Failed we keep the None value
        print('Failed to get thumbnail')
    
    # Fill in the dict
    volume_info = {
        'volume_id': volume_id,
        'volume_title': volume_title,
        'volume_authors': volume_authors,
        'volume_thumbnail': volume_thumbnail
    }
    
    # Return volume info
    return volume_info


if __name__ == '__main__':
    app.run()

Template index.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
    <script>
        let tracker = {}

        function get_thumbnail(id) {
            let url = '/get_volume_info?volume_id=' + id
            fetch(url).then(function (response) {
                return response.json();
            }).then(function (data) {
                console.log(data);
                return data['volume_thumbnail']
            }).catch(function () {
                console.log("Error");
            });
        }

        function image_load_failed(id) {
            let element = document.getElementById(id)

            if (isNaN(tracker[id])) {
                tracker[id] = 0
            }
            console.log(tracker[id])

            if (tracker[id] >= 5) {
                element.src = 'https://via.placeholder.com/128x196C/O%20https://placeholder.com/'
                return
            }

            element.src = get_thumbnail(id)
            tracker[id]++
        }
    </script>
</head>
<body>

<table>
    <tr>
        <th>ID</th>
        <th>Title</th>
        <th>Authors</th>
        <th>Thumbnail</th>
    </tr>
    {% for volume in volumes %}
        <tr>
            <td>{{ volume['volume_id'] }}</td>
            <td>{{ volume['volume_title'] }}</td>
            <td>{{ volume['volume_authors'] }}</td>
            <td><img id="{{ volume['volume_id'] }}" src="{{ volume['volume_thumbnail'] }}"
                     onerror="image_load_failed('{{ volume['volume_id'] }}')"></td>
        </tr>
    {% endfor %}

</table>

</body>
</html>
bobveringa
  • 230
  • 2
  • 12
1

Add a dummy image URL

"book_id": seek[i]['id'] or 'dummy_url'
Henshal B
  • 1,540
  • 12
  • 13