I have ~40.000 pytorch .pt files saved independetly on Google Drive. I would like to read each one and append the representation from a given layer (a 1280-length tensor) to a list in the fastest possible way.
I tried reading the files separately in a for loop with torch.load()
, then appending the requried layer to a list with mylist.append(mymodel['mean_representations'][MY_LAYER])
. Approx. ~3000 files are loaded and appended in a second, then the loop chokes (presumably the list becomes too large?) and appends only 1 file per second. This means I'd have to wait ~8 hours to read all the files.
Does anyone have a suggestion how to do this in a smarter way? I'm running the code on Colab.
I tried disabling the garbage collector, as suggested by some other posts, which helps increse the number of files appended within the first few seconds, then also goes to the 1 file per second rate.