1

I have a folder with thousands of .pkl files, each of them containing a list of tuple of objects. The larger file is 8 Gb large, and the smaller is 1 kB (empty list). I'm iterating through the files and loading each .pkl individually. Thus, the maximum memory consumption I should get should be the larger .pkl file. However, it went up to 65 Gb of RAM and I have no clue why...

In the code below, I provided the function creating and reading the pickle files, and the plotting function that I called on a specific folder. I did not provided all the function used, but they should not be the reason behind this issue.

Reason: Function such as PB only work on the file name, a string. => Low memory consumption. PB is only a value thus the list distance should not be too large.

import os
import _pickle as pickle
from matplotlib import pyplot as plt

def write_solutions(solutions, folder_path, file_name):
    """
    Function write a pickle file out of the solutions list.
    This function overwrites any existing file.
    """
    with open(join(folder_path, file_name), "wb") as output:
        pickle.dump(solutions, output, -1)

def read_solutions(folder_path, file_name):
    """
    Function reading a pickle file and returning the solutions list.
    """
    with open(join(folder_path, file_name), "rb") as input:
        solutions = pickle.load(input)
    return solutions

def plotting_solution_space_line_detailed(folder, output, N, max_pb, files = None):
    """
    Function taking .pkl in input and plotting.
    """
    if files == None:
        # Load all the .pkl files
        files = os.listdir(folder)
        files = [elt for elt in files if elt[len(elt)-4:] == ".pkl"]

    data = dict()
    for i in range(2, N+1):
        data[i] = [list(), list(), list(), list(), list(), list()]

    for file in files:
        item = read_solutions(folder, file)
        nfo = file_name_reader(file)
        n = len(nfo[0])
        desired_pb = PB(file)

        if len(item) == 0:
            data[n][3].append(1)
            data[n][2].append(desired_pb)

        else:
            data[n][1].append(1.1)
            data[n][0].append(desired_pb)

            # Computation of the actual closest PB
            distance = [abs(PB(file_namer(elt[0])) - desired_pb) for elt in item]
            i = distance.index(min(distance))
            plot_sol = item[i][0]
            actual_pb = PB(file_namer(plot_sol))

            # Plot of the acutal PB
            data[n][5].append(1.2)
            data[n][4].append(actual_pb)

    empty = list()
    for i in range(2, N+1):
        if len(data[i][0]) == 0 and len(data[i][2]) == 0:
            empty.append(i)

    for elt in empty:
        del data[elt]

    # Creates the figure
    f, ax = plt.subplots(len(data), sharex=True, sharey=True, figsize=(10,5))
    f.subplots_adjust(hspace=0)
    plt.setp([a.get_xticklabels() for a in f.axes[:-1]], visible=False)
    for a in f.axes:
        a.tick_params(
        axis='y',           # changes apply to the x-axis
        which='both',       # both major and minor ticks are affected
        left='False',
        right='False',
        labelleft='False')    # labels along the left edge are off

        # Shrink the axes
        box = a.get_position()
        a.set_position([box.x0, box.y0, box.width * 0.9, box.height])

        # Add a vertical line at the max budget
        a.axvline(x=max_pb, linestyle= '--',lw = 0.4, color = "black")

    if len(data) > 1:
        for i in range(len(data)):
            key = list(data.keys())[i]
            X = data[key][0]
            Y = data[key][1]
            X2 = data[key][2]
            Y2 = data[key][3]
            X3 = data[key][4]
            Y3 = data[key][5]
            ax[i].scatter(X, Y, s = 3)
            ax[i].scatter(X2, Y2, s = 3, color = "crimson")
            ax[i].scatter(X3, Y3, s = 3, color = "teal")
            ax[i].set_ylim(0.8, 1.4)
            ax[i].set_ylabel("{} Signals".format(key))
            ax[i].text(1.01, 0.6, "Nb with solution(s):\n{}".format(len(X)), fontsize=8, transform=ax[i].transAxes)
            ax[i].text(1.01, 0.2, "Nb without solution(s):\n{}".format(len([x for x in X2 if x <= max_pb])), fontsize=8, transform=ax[i].transAxes)

    else:
        key = list(data.keys())[0]
        X = data[key][0]
        Y = data[key][1]
        X2 = data[key][2]
        Y2 = data[key][3]
        X3 = data[key][4]
        Y3 = data[key][5]
        ax.scatter(X, Y, s = 3)
        ax.scatter(X2, Y2, s = 3, color = "crimson")
        ax.scatter(X3, Y3, s = 3, color = "teal")
        ax.set_ylim(0.8, 1.4)
        ax.set_ylabel("{} Signals".format(key))
        ax.text(1.01, 0.6, "Nb solutions:\n{}".format(len(X)), fontsize=12, transform=ax.transAxes)
        ax.text(1.01, 0.2, "Nb no solutions:\n{}".format(len([x for x in X2 if x <= max_pb])), fontsize=8, transform=ax.transAxes)

    f.text(0.5, 0.94, 'Solution space', ha='center')
    f.text(0.5, 0.04, 'PB', ha='center')
    f.text(0.04, 0.5, 'Number of signals', va='center', rotation='vertical')

    plt.savefig("{}.png".format(output), dpi = 500)
    plt.close()

Do you see any reason for such high memory consumption with .pkl files. Is it not as compressed once unpickled in the RAM? Or is it another issue?

Mathieu
  • 5,410
  • 6
  • 28
  • 55
  • 1
    Have you already seen this question? https://stackoverflow.com/questions/13871152/why-pickle-eat-memory – ChatterOne May 01 '18 at 10:17
  • @ChatterOne Ok well that was an easy one... I don't know how I actually passed through that post. Thanks a lot for pointing it out x') – Mathieu May 01 '18 at 10:19
  • Also, as suggested in the comments of the other question, you can have a look at https://github.com/pgbovine/streaming-pickle , which allows you to read the elements from pickle one by one, instead of loading them all into memory at the same time. – ChatterOne May 01 '18 at 14:23
  • @ChatterOne Indeed I did not know about this possibility when I coded my program... It's going to induce huge changes, but might be a solution :/ – Mathieu May 02 '18 at 07:28
  • @ChatterOne Btw the link you gave me seems quite old. The code always raise a TypeError (even with the given examples): write() argument must be str, not bytes. – Mathieu May 02 '18 at 08:19

1 Answers1

1

Your mistake is to assume size equivalence. An integer in a pickle is little more than the bytes needed. But an integer in memory is (some exceptions for small numbers non-withstanding) about 28 bytes.

This is the reason why normally one would use numpy to hold the data, as it’s compact arrays of fixed sized numeric types.

deets
  • 6,285
  • 29
  • 28
  • My files store this data: `[(S1, S2, S3), (S4, S5, S6), ..., (S7, S8, S9)]` where `S` is a custom made object. How would you store this data efficiently knowing that each file contains 1 list, and that the list can either be empty, or made of thousands (millions) of tuples of object `S`? I might go towards HDF5 instead of pickle. Hopeully I have a powerfull PC right now that can open large .pkl files (8 GB => 65 GB in RAM), but that's going to pose me problems soon... – Mathieu May 01 '18 at 10:29
  • As I do not know the makeup of S, I can’t comment on representing it differently. – deets May 01 '18 at 10:40
  • And BTW, HDF5 will NOT affect this. It is probably the better choice for other reasons (eg version compatibility etc) than pickle, but if your internal objects look the same, they need the same memory. – deets May 01 '18 at 10:46
  • For the S object, about 10 attributes are int and floats, and 1 is a list of floats. I agree that HDF5 it will not affect the memory on the disk, but it should affect the memory in the RAM (which is the issue I have). c.f. comment by ChatterOne. – Mathieu May 01 '18 at 11:18
  • If pickle adds even more overhead during load, fair enough. The resulting objects though will be the same. If the list is fixed sized or at least has an upper bound, you can represent all state as a vector of float. – deets May 01 '18 at 11:37
  • Sadly, I got no idea of the list size before I compute it. It can be empty or have 1 millions of tuples of the object in it. This is why I get .pkl file from 1 kb to 8 Gb. – Mathieu May 01 '18 at 11:42
  • Well, you could store the known data in a large numpy array (so that an instance of S only keeps an index to this) and then the dynamic list as one numpy array per S. That should already reduce the memory footprint significantly. – deets May 01 '18 at 12:45