1

I have a for loop that iterates over 17K text files for 100 times (epochs),

Before the for loop, I want to read and open them (cache them) once in the RAM, therefore I will be able to access them inside the for loop (very fast).

Have you any idea for this scenario?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Iman Irajian
  • 197
  • 12
  • check StringIO: https://stackoverflow.com/q/8240647/2419215 It basically stores strings in-memory, and it provides more or less the same interface as a file (read, write) – fodma1 Nov 29 '17 at 10:05
  • 1
    You should use spark for something like this. – pissall Nov 29 '17 at 10:12
  • @pissall : For some purpose, I must use pure Python to implement this scenario. – Iman Irajian Nov 29 '17 at 10:15
  • 1
    What total data size are you talking about here? – mata Nov 29 '17 at 10:27
  • @mata : Around 2GB – Iman Irajian Nov 29 '17 at 10:28
  • 1
    how about `file.read` ?, it loads the file into a string.... – Netwave Nov 29 '17 at 10:29
  • @DanielSanchez : I tried this solution, but this made my RAM full and my system halt. – Iman Irajian Nov 29 '17 at 10:41
  • So it depends on how much free RAM you have. Can't you just switch the logic, read one file and process it 100 times (once for each epoch). Doing something multiple times with data that is already in L1/L2 cache is preferable to iterating over the whole memory each time. At around 120k per file that would sound more reasonable. – mata Nov 29 '17 at 10:44
  • @mata : I can't switch the logic. – Iman Irajian Nov 29 '17 at 10:46
  • 1
    How much RAM do you have available, by the way? – ingofreyer Nov 29 '17 at 10:46
  • 1
    @ingofreyer : Around 8GB – Iman Irajian Nov 29 '17 at 10:48
  • 1
    So physically loading everything into RAM should be possible. Your actual problem may be somewhere else. You should post the code you currently have, otherwise it will be hard to say more. – mata Nov 29 '17 at 10:50
  • @ImanIrajian I added another possible way to solve your problem into my answer. If this is not possible, try storing the file content in a list instead of the `OrderedDict`. Just use `file_content_list = []` and `file_content_list.append(open(file_path, "r").read())` then. – ingofreyer Nov 29 '17 at 11:01
  • @ingofreyer : I will check them, thank you. – Iman Irajian Nov 29 '17 at 11:03
  • @ImanIrajian I am curious, could you solve the problem and how did you end up to do it? – ingofreyer Nov 30 '17 at 16:04
  • @ImanIrajian Did my answer solve your problem? In this case, I would be happy, if you could mark it as the correct answer for future users. Otherwise, please ask your further questions in the comments section of the answer. – ingofreyer Dec 07 '17 at 08:05
  • In Python there is a limit on number of simultaneously opened files. I solve this problem via a DataFrame from Pandas library. Before the for loop I read all files one by one and put each file's content in one of the rows of the DataFrame. Thanks all. – Iman Irajian Dec 09 '17 at 12:31
  • @ingofreyer In Python there is a limit on number of simultaneously opened files. I solve this problem via a DataFrame from Pandas library. Before the for loop I read all files one by one and put each file's content in one of the rows of the DataFrame. – Iman Irajian Dec 10 '17 at 16:14

2 Answers2

0

As documentation say:

To read a file’s contents, call f.read(size), which reads some quantity of data and returns it as a string. size is an optional numeric argument. When size is omitted or negative, the entire contents of the file will be read and returned; it’s your problem if the file is twice as large as your machine’s memory. Otherwise, at most size bytes are read and returned. If the end of the file has been reached, f.read() will return an empty string ("").

so, just use file.read method.

Alternatevely you can use mmap

Netwave
  • 40,134
  • 6
  • 50
  • 93
0

I would never recommend storing this many text files in RAM, most of the time, this would take more memory that you have available. Instead, I would recommend restructuring your for loop so that you do not have to iterate over them multiple times.

Since you are not saying that you need to change the files, I would recommend storing them all in a dictionary with the filename as a key. If you use an OrderedDict, then you can even just iterate through the contents (using .itervalues()) if the filenames are not important to you as well.

In this case, you could iterate over a list of file names using a for loop (create the list of filenames either directly using the according os functionality or provide it beforehand) and read all files into the dictionary:

import collections
d = collections.OrderedDict()
file_list = ["a", "b", "c"] # Fill data here or adapt for loop accordingly
for file_path in file_list:
    d[file_path] = open(file_path, "r").read()

Alternative way:

This is not an exactly matching solution but an alternative which might speed you up a little: I do not know the files you are using, but if you can distinguish between inputfiles, since they e.g. only contain one line each, ... you could instead copy them all over into one huge file and only walk through this file e.g. with

for line in huge_cache_file:
    # your current logic here

This would not speed you up as using your RAM would, but it would get rid of the overhead of opening and closing 17k files a hundred times. At the end of the big cache file, you could then just jump to the beginning again using

huge_cache_file.seek(0)

If newlines are not an option but your files would have a fixed length, you could still copy them together and iterate like this:

for file_content in huge_cache_file.read(file_length):
    # your current logic here

If files have a different length, you could still do this but store the file lengths of each individual file into an array, using those stored file lengths to read from the cache file:

file_lengths = [1024, 234, 16798704, ]  # all file lengths in sequence here
for epoch in range(0, 100):
    huge_cache_file.seek(0)
    for file_length in file_lengths:
        file_content = huge_cache_file.read(file_length)
        # your current logic here
ingofreyer
  • 1,086
  • 15
  • 27
  • he is not asking your opinion, the question is very specific, we should stick to it. This answer should be a comment. Anyway I agree with you at some points :) – Netwave Nov 29 '17 at 10:36
  • Well, the answer, including code, is right below. However, this piece of information often is crucial and not really an opinion. I am pretty sure, that the user already heard of `file.read()` and more wanted to know how to store all of the files in RAM in an efficient way. :-) – ingofreyer Nov 29 '17 at 10:39
  • your RAM limitation is the reason why I do not recommend to cache the files in RAM at all. If you do not want the overhead of a dictionary, you could also store the file contents in a list. However, if you have little RAM, this still will not help you. Instead, you should either rethink your second for loop and not iterate over all files 100 times but instead find a way to iterate only once. Depending on the rest of your implementation, you might also have additional variables storing a lot of content that might be freed up. – ingofreyer Nov 29 '17 at 10:59