I would never recommend storing this many text files in RAM, most of the time, this would take more memory that you have available. Instead, I would recommend restructuring your for loop so that you do not have to iterate over them multiple times.
Since you are not saying that you need to change the files, I would recommend storing them all in a dictionary with the filename as a key. If you use an OrderedDict, then you can even just iterate through the contents (using .itervalues()
) if the filenames are not important to you as well.
In this case, you could iterate over a list of file names using a for loop (create the list of filenames either directly using the according os functionality or provide it beforehand) and read all files into the dictionary:
import collections
d = collections.OrderedDict()
file_list = ["a", "b", "c"] # Fill data here or adapt for loop accordingly
for file_path in file_list:
d[file_path] = open(file_path, "r").read()
Alternative way:
This is not an exactly matching solution but an alternative which might speed you up a little:
I do not know the files you are using, but if you can distinguish between inputfiles, since they e.g. only contain one line each, ... you could instead copy them all over into one huge file and only walk through this file e.g. with
for line in huge_cache_file:
# your current logic here
This would not speed you up as using your RAM would, but it would get rid of the overhead of opening and closing 17k files a hundred times.
At the end of the big cache file, you could then just jump to the beginning again using
huge_cache_file.seek(0)
If newlines are not an option but your files would have a fixed length, you could still copy them together and iterate like this:
for file_content in huge_cache_file.read(file_length):
# your current logic here
If files have a different length, you could still do this but store the file lengths of each individual file into an array, using those stored file lengths to read from the cache file:
file_lengths = [1024, 234, 16798704, ] # all file lengths in sequence here
for epoch in range(0, 100):
huge_cache_file.seek(0)
for file_length in file_lengths:
file_content = huge_cache_file.read(file_length)
# your current logic here