Memory leakage issue in python list

Question

The identities list contains a big array of approximately 57000 images. Now, I am creating a negative list with the help of itertools.product(). This stores the whole list in memory which is very costly and my system hanged after 4 minutes.

How can I optimize the below code and avoid saving in memory?`

for i in range(0, len(idendities) - 1):
    for j in range(i + 1, len(idendities)):
        cross_product = itertools.product(samples_list[i], samples_list[j])
        cross_product = list(cross_product)

        for cross_sample in cross_product:
            negative = []
            negative.append(cross_sample[0])
            negative.append(cross_sample[1])
            negatives.append(negative)
            print(len(negatives))

negatives = pd.DataFrame(negatives, columns=["file_x", "file_y"])
negatives["decision"] = "No"

negatives = negatives.sample(positives.shape[0])

The memory 9.30 is going to be higher and higher and on one point the system has been completely hanged.

I also implemented the below answer and modified code according to his answer.

for i in range(0, len(idendities) - 1):
    for j in range(i + 1, len(idendities)):
        for cross_sample in itertools.product(samples_list[i], samples_list[j]):
            negative = [cross_sample[0], cross_sample[1]]
            negatives.append(negative)
            print(len(negatives))

negatives = pd.DataFrame(negatives, columns=["file_x", "file_y"])
negatives["decision"] = "No"

Third version of code

This CSV file is too big even if you open a file then it gives an alert that your program can not load all files. Regarding the process, it ten minutes, and then again the system going to be hanged completely.

for i in range(0, len(idendities) - 1):
    for j in range(i + 1, len(idendities)):
        for cross_sample in itertools.product(samples_list[i], samples_list[j]):
            with open('/home/khawar/deepface/tests/results.csv', 'a+') as csvfile:
                writer = csv.writer(csvfile)
                writer.writerow([cross_sample[0], cross_sample[1]])
            negative = [cross_sample[0], cross_sample[1]]
            negatives.append(negative)

negatives = pd.DataFrame(negatives, columns=["file_x", "file_y"])
negatives["decision"] = "No"

negatives = negatives.sample(positives.shape[0])

Memory screenshot.

yes; you do not need the whole list of samples at once ? do you need ? as i told you you can read it row by row even if you want to make some machine learning methods on it you shoud not load the wholde training data at one; you need to slice your data — DRPK, Jan 24 '21 at 06:19
so i guess it dose not about these lines now; maybe its about the other parts of your code — DRPK, Jan 24 '21 at 06:20
python has grabage collection somtimes it dose not clear his completed taska, variables and ... and you need to clean thease kind of things manually check this question and notify me : https://stackoverflow.com/questions/1316767/how-can-i-explicitly-free-memory-in-python — DRPK, Jan 24 '21 at 06:23
Actually, for measuring algorithm performance I need to compare all negative pairs and positive pairs. — Khawar Islam, Jan 24 '21 at 06:39
Sure do it; but not load the whole data at once; i guess you need some advices for your architecture; we have triangle in software developing; triangle is: CPU, Memory, Disk; in most of cases you can not boos these three togther at least you need to sacrifice on of them; if you have big data you need think about architecture to how can you slice it to smaller parts; do your calcualtion on them then join your results togther; for example if you want to measure proformance first save your each algorithm accuracy, duartion etc; these are very small numberical values; then save them in single file — DRPK, Jan 24 '21 at 06:53
is sample_list an identities not same ? Could you please clarify a bit on it ? — DhakkanCoder, Jan 28 '21 at 17:21

DRPK · Answer 1 · 2021-02-02T05:59:12.337

The product from itertools is a generator so naturally it dose not store the whole list in memory, but in the next line, cross_product = list(cross_product) you convert it to list object which store the whole data in your memory.

The idea of a generator is that you don't do all the calculation at the same time, as you do with your call list(itertools.product(samples_list[i], samples_list[j])). So what you want to do is generate the results one by one:

Try something like this:

for i in range(len(idendities) - 1):
    for j in range(i + 1, len(idendities)):
        for cross_sample in itertools.product(samples_list[i], samples_list[j]):
            # do something ...

So i guess i found your problem; you are appending all samples to negatives list first because of that your memory is going to be higher and higher, you need to write each row on realtime, one line at time;

Your data is csv right? so you can do this like:

import csv
for i in range(0, len(idendities) - 1):
    for j in range(i + 1, len(idendities)):
        for cross_sample in itertools.product(samples_list[i], samples_list[j]):

            with open('results.csv', 'a+') as csvfile:
                writer = csv.writer(csvfile)
                writer.writerow([cross_sample[0], cross_sample[1]])

The idea is writing your rows realtime

Check this link too how to write the real time data into csv file in python

Some credits to @9mat, @cybot and these questions How to get Cartesian product in Python using a generator?, how to write the real time data into csv file in python

Someone told this but never gives me a complete answer "the reason is that this way only ONE cross_sample from the cross_product iterator is instantiated in memory at a time. Saves massively on RAM" — Khawar Islam, Jan 23 '21 at 11:07

Alain T. · Answer 2 · 2021-01-28T21:32:19.767

You could create a class to represent the product of multiple lists that behaves like a list but doesn't store any of the combinations. This would only "combine" the items on demand.

class ProductList:    
    def __init__(self,*data):
        self.data = data
        self.size = 1
        for d in self.data: self.size *= len(d)

    def __len__(self): return self.size
    
    def __getitem__(self,index):
        if isinstance(index,slice):
            return [*map(self.__getitem__,range(len(self))[index])]
        result = tuple()
        for d in reversed(self.data):
            index,i = divmod(index,len(d))
            result = (d[i],) + result
        return result

    def __iter__(self):
        for i in range(len(self)): yield self[i]

    def __contains__(self,value):
        return len(value) == len(self.data) \
               and all(v in d for v,d in zip(value,self.data))
    
    def index(self,value):
        index = 0
        for v,d in zip(value,self.data):
            index = index*len(d)+d.index(v)
        return index

Usage:

p = ProductList(range(1234),range(1234,5678),range(5678,9101))

print(*p[:10],sep="\n")

(0, 1234, 5678)
(0, 1234, 5679)
(0, 1234, 5680)
(0, 1234, 5681)
(0, 1234, 5682)
(0, 1234, 5683)
(0, 1234, 5684)
(0, 1234, 5685)
(0, 1234, 5686)
(0, 1234, 5687)


len(p) # 18771376008

p[27]  # (2, 6, 12)

for c in p[103350956:103350960]: print(c)

(6, 4763, 5995)
(6, 4763, 5996)
(6, 4763, 5997)
(6, 4763, 5998)


p.index((6, 4763, 5995)) # 103350956
p[103350956]             # (6, 4763, 5995)

(6, 4763, 5995) in p     # True
(5995, 4763, 6) in p     # False

score 0 · Accepted Answer · answered Jan 30 '21 at 05:01

Actually, the generated pairs are saved in your memory and that's why your memory going to be higher and higher.

You have to change the code in which you will generate pairs and immediately release them from memory.

Previous Code:

for i in range(0, len(idendities) - 1):
    for j in range(i + 1, len(idendities)):
        cross_product = itertools.product(samples_list[i], samples_list[j])
        cross_product = list(cross_product)

        for cross_sample in cross_product:
            negative = []
            negative.append(cross_sample[0])
            negative.append(cross_sample[1])
            negatives.append(negative)
            print(len(negatives))

negatives = pd.DataFrame(negatives, columns=["file_x", "file_y"])
negatives["decision"] = "No"

Memory Efficient Code Save pairs in the list then second time no need to generate it again.

samples_list = list(identities.values())
negatives = pd.DataFrame()

    if Path("positives_negatives.csv").exists():
        df = pd.read_csv("positives_negatives.csv")
    else:
        for combo in tqdm(itertools.combinations(identities.values(), 2), desc="Negatives"):
            for cross_sample in itertools.product(combo[0], combo[1]):
                negatives = negatives.append(pd.Series({"file_x": cross_sample[0], "file_y": cross_sample[1]}).T,
                                             ignore_index=True)
        negatives["decision"] = "No"
        negatives = negatives.sample(positives.shape[0])
        df = pd.concat([positives, negatives]).reset_index(drop=True)
        df.to_csv("positives_negatives.csv", index=False)

Memory leakage issue in python list

3 Answers3