1

I am trying to make a list containing all the images in my dataset. There will be 50000 elements in this list.

images = []
for cls in classes:
      samples_per_class = len(os.listdir(PATH_FOR_EACH_CLASS)
      for i in range(samples_per_class):
          image_path = os.path.join(
              directory,
              cls,
              str(i+1).zfill(4) + ".png"
          )
          images.append(image_path)

But I found it's very slow to set up this big list. Is there any more efficient way to deal with big list initialization?

shx2
  • 61,779
  • 13
  • 130
  • 153
livemyaerodream
  • 890
  • 6
  • 19
  • What is 'very slow'? What would you expect? There is not much happening in your code that could take time, apart from disk access. – Thierry Lathuille May 12 '20 at 17:15
  • See [How can you profile a Python script?](https://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script) This will tell you where your script is spending most of its time — which you ***may*** be able to optimize depending on where that is. – martineau May 12 '20 at 17:19
  • `os.listdir` is a call that can be quite slow on some machine regarding you OS and your hardware. How many classes there are in average and how many samples/class? Is `PATH_FOR_EACH_CLASS` a constant or a hidden complex expression you did not want to put here? – Jérôme Richard May 12 '20 at 18:49

2 Answers2

1

It is very likely that the problem of inefficiency comes from the disk access. If you try to access the data set from cloud drive, then this would slow down the program. Save your dataset in your local Disc instead of cloud drive. Hope this would help!

haching
  • 247
  • 2
  • 7
0

You can time it - see below. The code below took 1.16 seconds

import timeit
from timeit import default_timer as timer
import os
start = timer()
images = []
for i in range(1000000):
        image_path = os.path.join("/home","colinpaice")
        images.append(image_path) 
end = timer()
print(end - start) 
colin paice
  • 94
  • 1
  • 6