Get files pictures with threads and queue in a particular website

Question

I'm trying to create a simple program in python3 with threads and queue to concurrent download images from URL links by using 4 or more threads to download 4 images at the same time and download said images in the downloads folder in the PC while avoiding duplicates by sharing the information between threads. I suppose I could use something like URL1= “Link1”? Here are some examples of links.

“https://unab-dw2018.s3.amazonaws.com/ldp2019/1.jpeg”

“https://unab-dw2018.s3.amazonaws.com/ldp2019/2.jpeg”

But I don't understand how to use threads with queue and I'm lost on how to do this.

I have tried searching for some page that can explain how to use threads with queue to concurrent download I have only found links for threads only.

Here is a code that it works partially. What i need is that the program ask how many threads you want and then download images until it reaches image 20, but on the code if input 5, it will only download 5 images and so on. The thing is that if i put 5, it will download 5 images first, then the following 5 and so on until 20. if its 4 images then 4, 4, 4, 4, 4. if its 6 then it will go 6,6,6 and then download the remaining 2. Somehow i must implement queue on the code but i just learn threads a few days ago and im lost on how to mix threads and queue together.

import threading
import urllib.request
import queue # i need to use this somehow


def worker(cont):
    print("The worker is ON",cont)
    image_download = "URL"+str(cont)+".jpeg"
    download = urllib.request.urlopen(image_download)
    file_save = open("Image "+str(cont)+".jpeg", "wb")
    file_save.write(download.read())
    file_save.close()
    return cont+1


threads = []
q_threads = int(input("Choose input amount of threads between 4 and 20"))
for i in range(0, q_threads):
    h = threading.Thread(target=worker, args=(i+1, int))
    threads.append(h)
for i in range(0, q_threads):
    threads[i].start()

I provided a cut down solution I used for a different project with renamed variables — alexanderhurst, Jun 08 '19 at 04:39
I added a code attempt to solve my problem but i still have some issues with it, such as how to implement queue — Treyon Daren, Jun 10 '19 at 00:29
for what you are doing in this code snippet you probably don't even need a queue. I did include a queue in my solution, and you can modify it to return whatever you want from the threads — alexanderhurst, Jun 10 '19 at 14:19

alexanderhurst · Accepted Answer · 2019-06-10T14:08:57.577

I adapted the following from some code I used to perform multi threaded PSO

import threading
import queue

if __name__ == "__main__":
    picture_queue = queue.Queue(maxsize=0)
    picture_threads = []
    picture_urls = ["string.com","string2.com"]

    # create and start the threads
    for url in picture_urls:
        picture_threads.append(picture_getter(url, picture_queue))
        picture_threads[i].start()

    # wait for threads to finish
    for picture_thread in picture_threads:
        picture_thread.join()

    # get the results
    picture_list = []
    while not picture_queue.empty():
        picture_list.append(picture_queue.get())

class picture_getter(threading.Thread):
    def __init__(self, url, picture_queue):
        self.url = url
        self.picture_queue = picture_queue
        super(picture_getter, self).__init__()

    def run(self):
        print("Starting download on " + str(self.url))
        self._get_picture()

    def _get_picture(self):
        # --- get your picture --- #
        self.picture_queue.put(picture)

Just so you know, people on stackoverflow like to see what you have tried first before providing a solution. However I have this code lying around anyway. Welcome aboard fellow newbie!

One thing I will add is that this does not avoid duplication by sharing information between threads. It avoids duplication as each thread is told what to download. If your filenames are numbered as they appear to be in your question this shouldn't be a problem as you can easily build a list of these.

Updated code to solve the edits to Treyons original post

import threading
import urllib.request
import queue
import time

class picture_getter(threading.Thread):
    def __init__(self, url, file_name, picture_queue):
        self.url = url
        self.file_name = file_name
        self.picture_queue = picture_queue

        super(picture_getter, self).__init__()

    def run(self):
        print("Starting download on " + str(self.url))
        self._get_picture()

    def _get_picture(self):
        print("{}: Simulating delay".format(self.file_name))
        time.sleep(1)

        # download and save image
        download = urllib.request.urlopen(self.url)
        file_save = open("Image " + self.file_name, "wb")
        file_save.write(download.read())
        file_save.close()
        self.picture_queue.put("Image " + self.file_name)

def remainder_or_max_threads(num_pictures, num_threads, iterations):
    # remaining pictures
    remainder = num_pictures - (num_threads * iterations)

    # if there are equal or more pictures remaining than max threads
    # return max threads, otherwise remaining number of pictures
    if remainder >= num_threads:
        return max_threads

    else:
        return remainder

if __name__ == "__main__":
    # store the response from the threads
    picture_queue = queue.Queue(maxsize=0)
    picture_threads = []
    num_pictures = 20

    url_prefix = "https://unab-dw2018.s3.amazonaws.com/ldp2019/"
    picture_names = ["{}.jpeg".format(i+1) for i in range(num_pictures)]

    max_threads = int(input("Choose input amount of threads between 4 and 20: "))

    iterations = 0

    # during the majority of runtime iterations * max threads is 
    # the number of pictures that have been downloaded
    # when it exceeds num_pictures all pictures have been downloaded
    while iterations * max_threads < num_pictures:
        # this returns max_threads if there are max_threads or more pictures left to download
        # else it will return the number of remaining pictures
        threads = remainder_or_max_threads(num_pictures, max_threads, iterations)

        # loop through the next section of pictures, create and start their threads
        for name, i in zip(picture_names[iterations * max_threads:], range(threads)):
            picture_threads.append(picture_getter(url_prefix + name, name, picture_queue))
            picture_threads[i + iterations * max_threads].start()

        # wait for threads to finish
        for picture_thread in picture_threads:
            picture_thread.join()

        # increment the iterations
        iterations += 1

    # get the results
    picture_list = []
    while not picture_queue.empty():
        picture_list.append(picture_queue.get())

    print("Successfully downloaded")
    print(picture_list)

you can also override the join method and make it return the picture as is done in [this post](https://stackoverflow.com/a/6894023/10626861) — alexanderhurst, Jun 08 '19 at 04:44
Thanks for the answer alexanderhurst. I tried to run this on python 3 but picture_getter and picture in the class its not getting recognized, and is im new to python3 and i dont get why the error is happening. I replaced string.com with the image URL i posted. — Treyon Daren, Jun 08 '19 at 23:23
@TreyonDaren In my original code i did not do anything to retrieve a picture, which is why picture would be undefined. I also left the main method `(if __name__ == "__main__")` above the picture_getter method for readability. Python however needs picture_getter to be defined before it is referenced in main, so to solve that main just needs to be below picture_getter. I have modified my code to include your image download code, solve the name definition issues and solve your new requirement of having a max number of threads. Let me know if you have any problems — alexanderhurst, Jun 10 '19 at 14:15
Many thanks for the code. I have learned a lot with the code but I'm confused on what exactly is happening on this part of the code. `for name, i in zip(picture_names[iterations*max_threads:], range(threads)): picture_threads.append(picture_getter(url_prefix + name, name, picture_queue)) picture_threads[i + iterations * max_threads].start()` Could you explain why you wrote it like this and what it does exactly? — Treyon Daren, Jun 14 '19 at 17:15
No problem, the line `for name, i in zip(picture_names[iterations * max_threads:], range(threads)):` is a for loop that loops over each element in a zip. A zip just creates an object with two objects like lists that you can step through at the same time. So if i had two lists, a=[1, 2, 3] and b=[a, b, c] it would loop through and pass back values in this order 1, a then 2, b then 3, c which get assigned name, i. This is so that i have the element that i want and an index. It is very similar to using enumerate, in fact looking back now i should have used enumerate. Cont. Next comment — alexanderhurst, Jun 15 '19 at 18:39
`picture_threads.append(picture_getter(url_prefix + name, name, picture_queue))` just adds the thread with arguments to the list of threads — alexanderhurst, Jun 15 '19 at 18:44
`picture_threads[i + iterations * max_threads].start()` starts the thread after it is created. i comes from the zip and is incremented to follow each item added to the list. And we use iterations * max threads so that we have an offset when we loop back to the beginning and i gets set back to 0 — alexanderhurst, Jun 15 '19 at 18:45
Hopefully that makes sense :) i have been drinking so if it doesnt let me know and ill take another look when i sober up. — alexanderhurst, Jun 15 '19 at 18:48

Get files pictures with threads and queue in a particular website

1 Answers1