1

For now, I am just running PCA and KNN on 400 rgb images of watches to find the most similar ones among them. I want to know how much memory I am using at each part of my program. For this reason I followed this link and my source code is the following:

import cv2
import numpy as np
import os
from glob import glob
from sklearn.decomposition import PCA
from sklearn import neighbors
from sklearn import preprocessing
import os
import psutil

def memory_usage():
    process = psutil.Process(os.getpid())
    print(round(process.memory_info().rss / (10 ** 9), 3), 'GB')

data = []

# Read images from file
for filename in glob('Watches/*.jpg'):

    img = cv2.imread(filename)
    height, width = img.shape[:2]
    img = np.array(img)

    # Check that all my images are of the same resolution
    if height == 529 and width == 940:

        # Reshape each image so that it is stored in one line
        img = np.concatenate(img, axis=0)
        img = np.concatenate(img, axis=0)
        data.append(img)

memory_usage()

# Normalise data
data = np.array(data)
Norm = preprocessing.Normalizer()
Norm.fit(data)
data = Norm.transform(data)

memory_usage()

# PCA model
pca = PCA(0.95)
pca.fit(data)
data = pca.transform(data)

memory_usage()

# K-Nearest neighbours
knn = neighbors.NearestNeighbors(n_neighbors=4, algorithm='ball_tree', metric='minkowski').fit(data)
distances, indices = knn.kneighbors(data)
print(indices)

memory_usage()

The output is the following:

0.334 GB  # after loading images
1.712 GB  # after data normalisation
1.5 GB    # after pca
1.503 GB  # after knn

What is the meaning of these outputs?

Do they represent the memory used at this point and is this a direct indicator of the memory required by the objects and functions of the program until this point (or things are more complicated)?

For example, why memory usage is higher after data normalisation than it is after PCA?

Outcast
  • 4,967
  • 5
  • 44
  • 99
  • this data is not as reasonable as you imagine. i do not know your ml library, but from what i can see, it's measuring RSS, residential memory. this memory is the amount that your process has in physical memory, which is a very bad indication of what's going on. for example, python maintains a memory pool in it, which means the heap usage might be low while RSS is high. also, RSS doesn't measure swap usage, which means if your process can use more than it says. anyways, i think you should find better alternatives on measuring memory usage. – Jason Hu Feb 22 '18 at 17:22
  • here you go: https://stackoverflow.com/questions/552744/how-do-i-profile-memory-usage-in-python – Jason Hu Feb 22 '18 at 17:26
  • Thank for your comment. Yes, perhaps I just have to find a better way to measure memory usage than `psutil` but I wrote this question exactly to clarify some stuff. – Outcast Feb 22 '18 at 17:39
  • then yeah, as i said, i don't think it's a meaningful measurement under this context. – Jason Hu Feb 22 '18 at 17:41

0 Answers0