-2

I have to implement a clustering algorithm, after loading the dataset, I go to check for each point in which cluster it can be inserted. If points cannot be inserted into any cluster, I have to move them from the dataset and insert them into the retained set. Since I do not know a priori the size of the retained set, I allocate an area of memory initially equal to 0 and that is incremented by the bytes size needed to hold a point each time I have to insert a point into the retained set. It works for some iterations (4 to be precise) and then stops

This is what I try:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <malloc/malloc.h>
#include <float.h>
#include <stdbool.h>

double **load_dataset(char *filename, int d, int chunk_size);

int assign_point_to_cluster(double **clusters, int **set, double **retained_set, double *point,double *standard_deviation,
                            int d, int k, int *chunk_size, int p_in_r);

int find_candidate_cluster(double **clusters, double *point, double *std_deviation, int d, int k);

double mean(const double *vector, int d);

double mahalanobis(const double *cluster, const double *point, const double *standard_deviation, int d);

void compute_std_dev(const double *cluster, double *standard_deviation_vector, int d);

int inizialize_cluster(double **dataset, int d, int k, double **clusters, int **set, int chunk_size, bool retain);

double compute_sum_of_euclidean_distance(double **center_points, double *point, int n, int d);

void feature_scaling(double **dataset, int d, int chunk_size);

int main(int argc, char **argv) {
    if(argc < 6){
        printf("Error parameters! Usage: ./main <input_file> <total number of point> <chunk_size> <points_dimension> <cluster_number>");
        return 0;
    }

    char* filename = argv[1];
    int d = atoi(argv[4]), k = atoi(argv[5]), chunk_size = atoi(argv[3]), total = atoi(argv[2]);
    int k_compressed = 0;

    printf("Path: %s\n", filename);
    printf("Number of point in set %i\n", total);
    printf("chunk size: %i\n", chunk_size);
    printf("Dimension of points: %i\n", d);
    printf("Number of cluster: %i\n", k);
    printf("----------------\n");


    double **clusters = malloc(k * sizeof(double *));
    double *standard_deviation = malloc(d * sizeof(double));
    int **discard_set = malloc(k * sizeof(int *));
    double **retained_set = malloc(1);
    double * cohesion = malloc(2 * sizeof(double));
    double* radius = NULL;
    double **mini_cluster = NULL;
    double* temp_cluster = NULL;
    int** compressed_set = NULL;
    double** mini_cluster_temp = NULL;
    int p_in_r = 0;



    double **dataset = load_dataset(filename, d, chunk_size);
    /**
     * Rescaling of variables
     */

    //feature_scaling(dataset, d, chunk_size); TODO: Something is wrong

    /**
     * Cluster initialization
     */


    if(!clusters || !discard_set || !standard_deviation || !retained_set || !cohesion){
        printf("Something went wrong in main(), memory allocation failed!");
        exit(1);
    }

    chunk_size = inizialize_cluster(dataset, d, k, clusters, discard_set, chunk_size, false);

    /**
     * At this point we are only interested in keeping a "summary" of the data that will be placed within a cluster.
     * In dataset we put the id of the points that are added to a cluster, while cluster contains the statistics
     * useful to perform clustering
     **/


    /**
     * We start processing the points the (CHUNK - 1)eighth point in the dataset is assigned to the cluster if the
     * mahalanobis distance is less than a threshold and if it is the closest.
     * Clusetering dataset -> discard_set
     */

    while (chunk_size > 0) {

        p_in_r += assign_point_to_cluster(clusters, discard_set, retained_set, dataset[chunk_size - 1], standard_deviation, d, k, &chunk_size, p_in_r);
        /**
         * always working on the last element of the dataset, it is not necessary to move the list of points,
         * just delete the last element
         */

        free(dataset[chunk_size]);
        dataset[chunk_size] = NULL;
        dataset = realloc(dataset, chunk_size * sizeof(double *));
        if(dataset == NULL){
            printf("Something went wrong in main(), memory allocation failed!");
            exit(1);
        }

    }

    free(dataset);
    dataset = NULL;


    return 0;


}

int inizialize_cluster(double **dataset, int d, int k, double ** clusters, int** set, int chunk_size, bool retain) {
    double ** center_point = malloc(k * sizeof(double *));

    for (int i = 0; i < k; i++) {
        center_point[i] = malloc((d + 1) * sizeof(double));

        if(center_point[i] == NULL){
            printf("Something went wrong in inizialize_cluster(), memory allocation failed!");
            exit(1);
        }

    }

    /**
     * The point representing the center of the first cluster is chosen as the first point in the dataset
     **/

    memcpy(*center_point, *dataset,  (d + 1) * sizeof(double));


    /**
     *  The first point can be removed from the dataset or
     *  in case we are working on the retained set, move it to the end.
    **/

    chunk_size--;
    if(retain){
        double* temp = malloc(sizeof(double *));
        memcpy(temp, dataset, sizeof(double *));
        memcpy(dataset, dataset+1, chunk_size * sizeof(double *));
        memcpy(dataset+chunk_size-1, temp, sizeof(double *));

        /*for (int i = 0; i < CHUNK; ++i) {
            printf("id[%i]: %f", dataset[i][0]);
        }*/
    }
    else{
        free(dataset[0]);
        memcpy(dataset, dataset+1, chunk_size * sizeof(double *));
        dataset[chunk_size] = NULL;
        dataset = realloc(dataset, chunk_size * sizeof(double *));
        if(dataset == NULL){
            printf("Something went wrong in inizialize_cluster(), memory allocation failed!");
            exit(1);
        }
    }

    /**
     * The centers of the next clusters are chosen as those that are furthest apart
     **/

    double max;
    int pos;
    double distance;

    for (int i = 1; i < k; i++) {

        /**
         * I choose the point that maximizes the sum of the distances from the centerpieces
         */

        max = -1;
        for (int j = 0; j < chunk_size; j++){
            distance = compute_sum_of_euclidean_distance(center_point, dataset[j], i, d);
            if (distance > max) {
                pos = j;
                max = distance;
            }
        }

        memcpy(*(center_point + i), *(dataset + pos), (d + 1) * sizeof(double));

        /**
         * When a point is chosen as the center of a cluster, I remove it from the dataset
        **/

        chunk_size--;
        if(retain){
            double** temp = malloc(sizeof(double *));
            memcpy(temp, dataset + pos, sizeof(double *));
            memcpy(dataset + pos, dataset + pos + 1, (chunk_size - pos) * sizeof(double *));
            memcpy(dataset + chunk_size - 1, temp, sizeof(double *));
        }
        else{
            free(dataset[pos]);
            memcpy(dataset + pos, dataset + pos + 1, (chunk_size - pos) * sizeof(double *));
            dataset = realloc(dataset, chunk_size * sizeof(double *));
            if(dataset == NULL){
                printf("Something went wrong in inizialize_cluster(), memory allocation failed!");
                exit(1);
            }
        }
    }

    /**
     * When I have found k points that can be used as the initial centres of the k clusters,
     * I summarize them (calculate cluster statistics) and enter them into the discard set.
     */

    for (int i = 0; i < k; i++) {
        /**
         * Cluster and discard set initialization
         */
        clusters[i] = malloc(((2 * d) + 1) * sizeof(double));
        set[i] = malloc(sizeof(int ));
        if(clusters[i] == NULL || set[i] == NULL){
            printf("Something went wrong in in inizialize_cluster(), memory allocation failed!");
            exit(1);
        }

        clusters[i][0]=1;
        set[i][0] = (int) center_point[i][0];
        for (int j = 1; j < d + 1; j++) {
            clusters[i][j] = center_point[i][j];
            clusters[i][j + d] = pow(center_point[i][j], 2);
        }
        free(center_point[i]);
        center_point[i] = NULL;
    }

    free(center_point);
    center_point = NULL;

    return chunk_size;
}

double **load_dataset(char *filename, int d, int chunk_size) {

    double **dataset = malloc(chunk_size * sizeof(double *));

    if(dataset == NULL){
        printf("Something went wrong in load_dataset(), memory allocation failed!");
        exit(1);
    }

    for (int i = 0; i < chunk_size; i++) {
        dataset[i] = malloc((d + 1) * sizeof(double));
        if(dataset[i] == NULL){
            printf("Something went wrong in load_dataset(), memory allocation failed!");
            exit(1);
        }
    }

    FILE *file;
    file=fopen(filename, "r");
    if (file == NULL){
        printf("Something went wrong in load_dataset(), file opening failed! (row 162)");
        exit(1);
    }

    char *line = NULL, *token;
    size_t len = 0;
    int i = 0;
    int j = 0;
    int first_line = 0;

    while ((getline(&line, &len, file)) != -1 && i < chunk_size) {
        if(first_line != 0) {
            while ((token = strsep(&line, ",")) != NULL) {
                dataset[i][j] = atof(token);
                j++;
            }
            j = 0;
            i++;
        } else{
            first_line = 1;
        }
    }

    fclose(file);

    return dataset;
}



int assign_point_to_cluster(double **clusters, int **set, double **retained_set, double *point,double *standard_deviation,
                            int d, int k, int *chunk_size, int p_in_r) {

    /**
     * For each point I assess which cluster it can go into
     */

    int candidate;
    candidate = find_candidate_cluster(clusters, point, standard_deviation, d, k);

    /**
     * After identifying the candidate cluster (if there is one), I add the point to the discard set and update the
     * cluster statistics otherwise I go ahead and put the point in the retained set
     */

    (*chunk_size)--;
    if(candidate > -1){

        /**
         * I add the point to the discard/compressed set
         */

        clusters[candidate][0]++;

        set[candidate] = realloc(set[candidate], (unsigned long)clusters[candidate][0] * sizeof(int));

        if(set[candidate] == NULL){
            printf("Something went wrong in in assign_point_to_cluster(), memory allocation failed!");
            exit(1);
        }

        set[candidate][(int) clusters[candidate][0] - 1] = (int) point[0];

        /**
         * I update the cluster statistics
         */

        for (int i = 1; i < d + 1; i++) {
            clusters[candidate][i] += point[i];
            clusters[candidate][i + d] += pow(point[i], 2);
        }
    }
    else if(retained_set){

        /**
         * I insert the point in the retained set
         */

        p_in_r++;
        retained_set = realloc(retained_set, p_in_r * sizeof(double *));
        retained_set[p_in_r - 1] = malloc((d + 1) * sizeof(double));
        memcpy(*(retained_set + p_in_r - 1), point, (d + 1) * sizeof(double ));

        return 1;
    }

    return 0;
}

int find_candidate_cluster(double **clusters, double *point, double *std_deviation, int d, int k) {
    double actual = DBL_MAX;
    int candidate = -1;
    double threshold;
    double distance;

    for (int j = 0; j < k; j++) {
        /**
           * Calculation of varainza,threshold and mahalanobis' distance
         */

        compute_std_dev(clusters[j], std_deviation, d);

        //TODO: Would it be okay as a threshold? An alternative could be the module?
        threshold = 3.5 * mean(std_deviation, d);

        distance = mahalanobis(clusters[j], point, std_deviation, d);

        if(distance < threshold && distance < actual){
            /**
             * the cluster is a candidate for the point
             */
            candidate = j;
            actual = distance;
        }
    }
    return candidate;
}

double mean(const double *vector, int d) {
    double sum = 0;
    for (int i = 0; i < d; ++i) {
        sum += vector[i];
    }
    return sum/d;
}

void compute_std_dev(const double *cluster, double *standard_deviation_vector, int d) {
    double sigma;
    /**
     * Vector of the variances of the components of the cluster elements
     */

    
    for (int i = 0; i < d; i++) {
        sigma = sqrt(fabs(cluster[i + 1 + d]/cluster[0] - pow(cluster[i + 1]/cluster[0], 2)));
        if( sigma == 0)
            sigma = 1;
        standard_deviation_vector[i] = sigma;
    }
}

double mahalanobis(const double *cluster, const double *point, const double *standard_deviation, int d) {
    double distance=0;
    for (int i = 1; i < d; ++i) {
        distance += pow((point[i] - cluster[i]) / standard_deviation[i - 1], 2);
    }
    return sqrt(distance)/d; //TODO: can it be okay? I thought so since the threshold is the average of the st.dev.
}


double compute_sum_of_euclidean_distance(double **center_points, double *point, int n, int d) {
    double component_sum = 0;
    double final_sum = 0;

    for (int i = 0; i < n; i++) {
        for (int j = 1; j < d + 1; j++){
            component_sum += pow(center_points[i][j] - point[j], 2);
        }
        final_sum += sqrt(component_sum);
    }
    return final_sum;
}

void feature_scaling(double **dataset, int d, int chunk_size) {
    /**
     * We perform a Z-score Normalization
     **/

    double mean;
    double sigma;
    double sum;
    double sumQ;
    double variance;

    /**
     * We calculate mean and variance for each column
     **/

    for (int i = 1; i < d + 1; i++) {

        sum = 0;

        for (int j = 0; j < chunk_size; j++) {
            sum += dataset[j][i];
        }
        mean = sum / chunk_size;

        sumQ = 0;
        for (int j = 0; j < chunk_size; j++) {
            sumQ += pow((dataset[j][i] - mean), 2);
        }

        variance = sumQ / chunk_size;

        sigma = sqrt(variance);
        if( sigma == 0)
            sigma = 1;

        /**
         * Feature scaling: (x-x_med)/sigma
         **/

        for (int j = 0; j < chunk_size; j++) {
            dataset[j][i] = (dataset[j][i] - mean) / sigma;
        }
    }

}

The command I use when run is:

./main "db.csv" 100 35 4 3

It works if the 3rd argument is less then 34

The file db.csv contains:

CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100),cluster
1,0,19,15,39,4
2,0,21,47,81,3
3,1,20,56,6,4
4,1,23,16,77,3
5,1,31,17,40,4
6,1,22,17,76,3
7,1,35,18,6,4
8,1,23,18,94,3
9,0,64,19,3,4
10,1,30,19,72,3
11,0,67,19,14,4
12,1,35,19,99,3
13,1,58,20,15,4
14,1,24,20,77,3
15,0,37,20,13,4
16,0,22,20,79,3
17,1,35,21,35,4
18,0,20,21,66,3
19,0,52,23,29,4
20,1,35,23,98,3
21,0,35,24,35,4
22,0,25,24,73,3
23,1,46,25,5,4
24,0,31,25,73,3
25,1,54,28,14,4
26,0,29,28,82,3
27,1,45,28,32,4
28,0,35,28,61,3
29,1,40,29,31,4
30,1,23,29,87,3
31,0,60,30,4,4
32,1,21,30,73,3
33,0,53,33,4,4
34,0,18,33,92,3
35,1,49,33,14,4
36,1,21,33,81,3
37,1,42,34,17,4
38,1,30,34,73,3
39,1,36,37,26,4
40,1,20,37,75,3
41,1,65,38,35,0
42,0,24,38,92,3
43,0,48,39,36,0
44,1,31,39,61,5
45,1,49,39,28,4
46,1,24,39,65,3
47,1,50,40,55,0
48,1,27,40,47,5
49,1,29,40,42,5
50,1,31,40,42,5
51,1,49,42,52,0
52,0,33,42,60,5
53,1,31,43,54,5
54,0,59,43,60,0
55,1,50,43,45,0
56,0,47,43,41,0
57,1,51,44,50,0
58,0,69,44,46,0
59,1,27,46,51,5
60,0,53,46,46,0
61,0,70,46,56,0
62,0,19,46,55,5
63,1,67,47,52,0
64,1,54,47,59,0
65,0,63,48,51,0
66,0,18,48,59,5
67,1,43,48,50,0
68,1,68,48,48,0
69,0,19,48,59,5
70,1,32,48,47,5
71,0,70,49,55,0
72,1,47,49,42,0
73,1,60,50,49,0
74,1,60,50,56,0
75,0,59,54,47,0
76,0,26,54,54,5
77,1,45,54,53,0
78,0,40,54,48,5
79,1,23,54,52,5
80,1,49,54,42,0
81,0,57,54,51,0
82,0,38,54,55,5
83,0,67,54,41,0
84,1,46,54,44,0
85,1,21,54,57,5
86,0,48,54,46,0
87,1,55,57,58,0
88,1,22,57,55,5
89,1,34,58,60,5
90,1,50,58,46,0
91,1,68,59,55,0
92,0,18,59,41,5
93,0,48,60,49,0
94,1,40,60,40,5
95,1,32,60,42,5
96,0,24,60,52,5
97,1,47,60,47,0
98,1,27,60,50,5
99,0,48,61,42,0
100,0,20,61,49,5
101,1,23,62,41,5
102,1,49,62,48,0
103,0,67,62,59,0
104,0,26,62,55,5
105,0,49,62,56,0
106,1,21,62,42,5
107,1,66,63,50,0
108,0,54,63,46,0
109,0,68,63,43,0
110,0,66,63,48,0
111,0,65,63,52,0
112,1,19,63,54,5
113,1,38,64,42,5
114,0,19,64,46,5
115,1,18,65,48,5
116,1,19,65,50,5
117,1,63,65,43,0
118,1,49,65,59,0
119,1,51,67,43,0
120,1,50,67,57,0
121,0,27,67,56,5
122,1,38,67,40,5
123,1,40,69,58,5
124,0,39,69,91,1
125,1,23,70,29,5
126,1,31,70,77,1
127,0,43,71,35,2
128,0,40,71,95,1
129,0,59,71,11,2
130,0,38,71,75,1
131,0,47,71,9,2
132,0,39,71,75,1
133,1,25,72,34,5
134,1,31,72,71,1
135,0,20,73,5,2
136,1,29,73,88,1
137,1,44,73,7,2
138,0,32,73,73,1
139,0,19,74,10,2
140,1,35,74,72,1
141,1,57,75,5,2
142,0,32,75,93,1
143,1,28,76,40,5
144,1,32,76,87,1
145,0,25,77,12,2
146,0,28,77,97,1
147,0,48,77,36,2
148,1,32,77,74,1
149,1,34,78,22,2
150,0,34,78,90,1
151,0,43,78,17,2
152,0,39,78,88,1
153,1,44,78,20,2
154,1,38,78,76,1
155,1,47,78,16,2
156,1,27,78,89,1
157,0,37,78,1,2
158,1,30,78,78,1
159,0,34,78,1,2
160,1,30,78,73,1
161,1,56,79,35,2
162,1,29,79,83,1
163,0,19,81,5,2
164,1,31,81,93,1
165,0,50,85,26,2
166,1,36,85,75,1
167,0,42,86,20,2
168,1,33,86,95,1
169,1,36,87,27,2
170,0,32,87,63,1
171,0,40,87,13,2
172,0,28,87,75,1
173,0,36,87,10,2
174,0,36,87,92,1
175,1,52,88,13,2
176,1,30,88,86,1
177,0,58,88,15,2
178,0,27,88,69,1
179,0,59,93,14,2
180,0,35,93,90,1
181,1,37,97,32,2
182,1,32,97,86,1
183,0,46,98,15,2
184,1,29,98,88,1
185,1,41,99,39,2
186,0,30,99,97,1
187,1,54,101,24,2
188,0,28,101,68,1
189,1,41,103,17,2
190,1,36,103,85,1
191,1,34,103,23,2
192,1,32,103,69,1
193,0,33,113,8,2
194,1,38,113,91,1
195,1,47,120,16,2
196,1,35,120,79,1
197,1,45,126,28,2
198,0,32,126,74,1
199,0,32,137,18,2
200,0,30,137,83,1

download it from mega: db.csv.

Originally found on Kaggle but I made some modifications.

Edit: I included the whole code

Edit: I alse get this error trying to see what is in retained_set : read memory from 0x3d2fdfcb8030 failed (0 of 8 bytes read)

Edit: I translate the comment in the code and added the file I use as input

  • 1
    [don't cast malloc](https://stackoverflow.com/questions/605845/do-i-cast-the-result-of-malloc) – Barmar Nov 30 '22 at 15:09
  • I insert the main because i use a function, in this way you can understand where the main ends and the function start, I'm going to edit to make all declaration understandable – Luca Marchio Nov 30 '22 at 15:15
  • @Barmar should it be the problem, or it is just a good practice? – Luca Marchio Nov 30 '22 at 15:22
  • 1
    just good practice, it's rarely the cause of a problem. Read the linked question to see what problems it can cause. – Barmar Nov 30 '22 at 15:22
  • The result of `malloc(0)` (size zero) is implementation defined. What does your implementation do? If you are not sure it's probably best to use a size that all implementations agree on how to treat. – Ted Lyngmo Nov 30 '22 at 15:37
  • @TedLyngmo debugging i see it return an address, but also using `malloc(1)` I get the same error – Luca Marchio Nov 30 '22 at 15:47
  • @LucaMarchio The code in the question contains syntax errors. Please [edit] your question and copy&paste exactly the code you compile and run on your system. Also show the input / command line arguments you use to reproduce the problem, the output/result/behavior or error message you get and the expected output/result/behavior. --> [mre] – Bodo Nov 30 '22 at 15:58
  • Your program takes input from files. Please edit your question and post the input data in a separate code block here. It should be downloadable. Also, please post the command arguments. We should be able to download all that, compile your program, and run it under a debugger (e.g. `gdb`). Also, it may be helpful if the key comments are in English as it seems they describe your intent, but are in Italian. I've had some luck with google translate. – Craig Estey Nov 30 '22 at 17:42
  • Should I pass the address of `retained_set` and `dataset` to the function if into which I use realloc or malloc? Doing something like `* retained_set = realloc(*retained_set, new_size)`? – Luca Marchio Dec 01 '22 at 08:44

1 Answers1

0

assign_point_to_cluster has a local variable double **retained_set. This means that you cannot do retained_set = realloc(retained_set, ... or you will just change where that local variable points at, not where the pointer-to-pointer on the caller side points at. And because of that you also create a memory leak. See this FAQ: Dynamic memory access only works inside function

As for how to solve it, it appears that encapsulating all of this data into structs would simplify the program a lot. You could also implement it as an "opaque type" (How to do private encapsulation in C?) and get rid of the caller's responsibility to handle dynamic allocation.

Using 2D arrays instead of pointer-to-pointers might also simplify the program and improve performance. For example if you could use a "pointer to array pointer" parameter double (**retained_set)[x][y]) then you could do double (*tmp)[x][y] = realloc(*retained_set,...) and then *retained_set = tmp;, which would affect the caller. But structs would be easier to read so that should be the first option.

Also note that malloc.h has been obsolete since forever. difference between <stdlib.h> and <malloc.h>

Lundin
  • 195,001
  • 40
  • 254
  • 396