Where is the bottleneck in my image manipulation code?

Question

I wrote this script to do some image processing on a large number of PNG files (around 1500 in total). They are organized into subdirectories.

That's my code:

from PIL import Image
import os

path = "/Some/given/path"

file_list = []
counter = 1

for root, dirs, files in os.walk(path):
    for file in files:
        if file.endswith(".png"):
            temp_file = {"path": os.path.join(root, file), "name": file}
            file_list.append(temp_file)

for curr_file in file_list:
    img = Image.open(curr_file["path"])
    img = img.convert("RGBA")
    val = list(img.getdata())
    new_data = []
    for item in val:
        if item[3] == 0:
            new_data.append(item)
        else:
            new_data.append((0, 0, 0, 255))
        img.putdata(new_data)
    file_name = "transform" + str(counter) + ".png"
    replaced_text = curr_file["name"]
    new_file_name = curr_file["path"].replace(replaced_text, file_name)
    img.save(new_file_name)
    counter += 1

The folder structure is as follows:

Source folder
     -- folder__1
        -- image_1.png
        -- image_2.png
        -- image_3.png
     -- folder__2
        -- image_3.png
        -- image_5.png
     -- folder__3
        -- image_6.png

When testing on individual images, the image processing takes only a few seconds. However, when running the script, it takes around an hour to process 15 images. Any suggestions on where I'm messing up?

If you want to know where the bottleneck is, the first thing you should do is [use the profiler](https://docs.python.org/3/library/profile.html). — Karl Knechtel, Apr 13 '21 at 17:06
Use snakeviz - https://jiffyclub.github.io/snakeviz/ to generate cprofile and then visualise it. — Nk03, Apr 13 '21 at 17:09
That said, to get performance with this kind of image manipulation you certainly want to [get Numpy data](https://stackoverflow.com/questions/384759/how-to-convert-a-pil-image-into-a-numpy-array) and then do Numpy things. — Karl Knechtel, Apr 13 '21 at 17:10
You really shouldn't be using, or even thinking of using `for` loops or lists with images in Python. Use `Numpy` like Hans shows, and then, if you have thousands of images, use `multiprocessing` in these days of multi-core CPUs. — Mark Setchell, Apr 13 '21 at 20:18

HansHirse · Accepted Answer · 2021-04-13T19:54:23.737

The main issue is located here:

new_data = []
for item in val:
    if item[3] == 0:
        new_data.append(item)
    else:
        new_data.append((0, 0, 0, 255))
    img.putdata(new_data)                   # <--

You don't need to update the content of img for each pixel, if you're collecting the complete new_data anyway. So, just move that line outside the loop:

new_data = []
for item in val:
    if item[3] == 0:
        new_data.append(item)
    else:
        new_data.append((0, 0, 0, 255))
img.putdata(new_data)                       # <--

Now, get rid of iterating all pixels at all by using NumPy and its vectorization capabilities:

from PIL import Image
import os
import numpy as np                          # <--

path = "/Some/given/path"

file_list = []
counter = 1

for root, dirs, files in os.walk(path):
    for file in files:
        if file.endswith(".png"):
            temp_file = {"path": os.path.join(root, file), "name": file}
            file_list.append(temp_file)

for curr_file in file_list:
    img = Image.open(curr_file["path"])
    img = img.convert("RGBA")
    img = np.array(img)                     # <--
    img[img[..., 3] != 0] = (0, 0, 0, 255)  # <--
    img = Image.fromarray(img)              # <--
    file_name = "transform" + str(counter) + ".png"
    replaced_text = curr_file["name"]
    new_file_name = curr_file["path"].replace(replaced_text, file_name)
    img.save(new_file_name)
    counter += 1

Basically, you set all pixels with alpha channel not equal to 0 to (0, 0, 0, 255). That's the NumPy one-liner you see there. The line before and after are just for transformation from Pillow Image to NumPy array and vice versa.

EDIT: If you don't want to have NumPy in your code, you could also get rid of the loops by using Pillow's point function, cf. this tutorial:

from PIL import Image
import os

path = "/Some/given/path"

file_list = []
counter = 1

for root, dirs, files in os.walk(path):
    for file in files:
        if file.endswith(".png"):
            temp_file = {"path": os.path.join(root, file), "name": file}
            file_list.append(temp_file)

for curr_file in file_list:
    img = Image.open(curr_file["path"])
    img = img.convert("RGBA")
    source = img.split()                                                # <--
    mask = source[3].point(lambda i: i > 0 and 255)                     # <--
    img.paste(Image.new("RGBA", img.size, (0, 0, 0, 255)), None, mask)  # <--
    file_name = "transform" + str(counter) + ".png"
    replaced_text = curr_file["name"]
    new_file_name = curr_file["path"].replace(replaced_text, file_name)
    img.save(new_file_name)
    counter += 1

----------------------------------------
System information
----------------------------------------
Platform:      Windows-10-10.0.16299-SP0
Python:        3.9.1
NumPy:         1.20.2
Pillow:        8.1.2
----------------------------------------

score 0 · Answer 2 · answered Apr 13 '21 at 17:12

You can use snakeviz library to profile your code -

Snakeviz - https://jiffyclub.github.io/snakeviz/

python -m cProfile -o program.prof my_program.py

Once the profile is generated you can visualise and see which function/which line is taking more time.

snakeviz program.prof

Where is the bottleneck in my image manipulation code?

2 Answers2