0

I have ~40.000 pytorch .pt files saved independetly on Google Drive. I would like to read each one and append the representation from a given layer (a 1280-length tensor) to a list in the fastest possible way.

I tried reading the files separately in a for loop with torch.load(), then appending the requried layer to a list with mylist.append(mymodel['mean_representations'][MY_LAYER]). Approx. ~3000 files are loaded and appended in a second, then the loop chokes (presumably the list becomes too large?) and appends only 1 file per second. This means I'd have to wait ~8 hours to read all the files.

Does anyone have a suggestion how to do this in a smarter way? I'm running the code on Colab.

I tried disabling the garbage collector, as suggested by some other posts, which helps increse the number of files appended within the first few seconds, then also goes to the 1 file per second rate.

Andrija
  • 11
  • 4
  • How large is each file? Have you checked memory and swap usage as the program runs? – Carcigenicate Feb 11 '23 at 22:44
  • Thanks for the suggestion! What would be your suggested way of doing this? I followed the instructions from this link (https://www.geeksforgeeks.org/how-to-get-current-cpu-and-ram-usage-in-python/) to monitor RAM and CPU usage. Indeed, CPU usage increases to 55% (and keeps increasing) once the loop slows down. – Andrija Feb 11 '23 at 23:09
  • I would just use Task Manager/`htop`. That would be the easiest way by far unless you wanted to log utilization or something. But again, how big are the files? You can just calculate if you'll have memory problems. – Carcigenicate Feb 11 '23 at 23:15
  • This post explains how to monitor swapping with psutil: https://stackoverflow.com/questions/37760854/detect-swapping-in-python Interestingly, when I re-ran the loop with the swapping monitor, it appended everything within seconds (as it should, because all the .pt files are 186 MB in total). I guess it could have been some connection issue between Colab & Google Drive? I must have run this block of code a dozen times with the same outcome. Anyway, thanks all for your helpful feedback - I appreciate it! – Andrija Feb 11 '23 at 23:27

0 Answers0