-1

I wrote this script to do some image processing on a large number of PNG files (around 1500 in total). They are organized into subdirectories.

That's my code:

from PIL import Image
import os

path = "/Some/given/path"

file_list = []
counter = 1

for root, dirs, files in os.walk(path):
    for file in files:
        if file.endswith(".png"):
            temp_file = {"path": os.path.join(root, file), "name": file}
            file_list.append(temp_file)

for curr_file in file_list:
    img = Image.open(curr_file["path"])
    img = img.convert("RGBA")
    val = list(img.getdata())
    new_data = []
    for item in val:
        if item[3] == 0:
            new_data.append(item)
        else:
            new_data.append((0, 0, 0, 255))
        img.putdata(new_data)
    file_name = "transform" + str(counter) + ".png"
    replaced_text = curr_file["name"]
    new_file_name = curr_file["path"].replace(replaced_text, file_name)
    img.save(new_file_name)
    counter += 1

The folder structure is as follows:

Source folder
     -- folder__1
        -- image_1.png
        -- image_2.png
        -- image_3.png
     -- folder__2
        -- image_3.png
        -- image_5.png
     -- folder__3
        -- image_6.png

When testing on individual images, the image processing takes only a few seconds. However, when running the script, it takes around an hour to process 15 images. Any suggestions on where I'm messing up?

HansHirse
  • 18,010
  • 10
  • 38
  • 67
  • 1
    If you want to know where the bottleneck is, the first thing you should do is [use the profiler](https://docs.python.org/3/library/profile.html). – Karl Knechtel Apr 13 '21 at 17:06
  • Use snakeviz - https://jiffyclub.github.io/snakeviz/ to generate cprofile and then visualise it. – Nk03 Apr 13 '21 at 17:09
  • That said, to get performance with this kind of image manipulation you certainly want to [get Numpy data](https://stackoverflow.com/questions/384759/how-to-convert-a-pil-image-into-a-numpy-array) and then do Numpy things. – Karl Knechtel Apr 13 '21 at 17:10
  • You really shouldn't be using, or even thinking of using `for` loops or lists with images in Python. Use `Numpy` like Hans shows, and then, if you have thousands of images, use `multiprocessing` in these days of multi-core CPUs. – Mark Setchell Apr 13 '21 at 20:18

2 Answers2

1

The main issue is located here:

new_data = []
for item in val:
    if item[3] == 0:
        new_data.append(item)
    else:
        new_data.append((0, 0, 0, 255))
    img.putdata(new_data)                   # <--

You don't need to update the content of img for each pixel, if you're collecting the complete new_data anyway. So, just move that line outside the loop:

new_data = []
for item in val:
    if item[3] == 0:
        new_data.append(item)
    else:
        new_data.append((0, 0, 0, 255))
img.putdata(new_data)                       # <--

Now, get rid of iterating all pixels at all by using NumPy and its vectorization capabilities:

from PIL import Image
import os
import numpy as np                          # <--

path = "/Some/given/path"

file_list = []
counter = 1

for root, dirs, files in os.walk(path):
    for file in files:
        if file.endswith(".png"):
            temp_file = {"path": os.path.join(root, file), "name": file}
            file_list.append(temp_file)

for curr_file in file_list:
    img = Image.open(curr_file["path"])
    img = img.convert("RGBA")
    img = np.array(img)                     # <--
    img[img[..., 3] != 0] = (0, 0, 0, 255)  # <--
    img = Image.fromarray(img)              # <--
    file_name = "transform" + str(counter) + ".png"
    replaced_text = curr_file["name"]
    new_file_name = curr_file["path"].replace(replaced_text, file_name)
    img.save(new_file_name)
    counter += 1

Basically, you set all pixels with alpha channel not equal to 0 to (0, 0, 0, 255). That's the NumPy one-liner you see there. The line before and after are just for transformation from Pillow Image to NumPy array and vice versa.


EDIT: If you don't want to have NumPy in your code, you could also get rid of the loops by using Pillow's point function, cf. this tutorial:

from PIL import Image
import os

path = "/Some/given/path"

file_list = []
counter = 1

for root, dirs, files in os.walk(path):
    for file in files:
        if file.endswith(".png"):
            temp_file = {"path": os.path.join(root, file), "name": file}
            file_list.append(temp_file)

for curr_file in file_list:
    img = Image.open(curr_file["path"])
    img = img.convert("RGBA")
    source = img.split()                                                # <--
    mask = source[3].point(lambda i: i > 0 and 255)                     # <--
    img.paste(Image.new("RGBA", img.size, (0, 0, 0, 255)), None, mask)  # <--
    file_name = "transform" + str(counter) + ".png"
    replaced_text = curr_file["name"]
    new_file_name = curr_file["path"].replace(replaced_text, file_name)
    img.save(new_file_name)
    counter += 1
----------------------------------------
System information
----------------------------------------
Platform:      Windows-10-10.0.16299-SP0
Python:        3.9.1
NumPy:         1.20.2
Pillow:        8.1.2
----------------------------------------
HansHirse
  • 18,010
  • 10
  • 38
  • 67
0

You can use snakeviz library to profile your code -

Snakeviz - https://jiffyclub.github.io/snakeviz/

python -m cProfile -o program.prof my_program.py

Once the profile is generated you can visualise and see which function/which line is taking more time.

snakeviz program.prof
Nk03
  • 14,699
  • 2
  • 8
  • 22