-1

I am required to write a program that implements kmeans clustering for a given dataset (I roughly understand how kmeans algorithm works). Since I want my program to be generic, I'd like to understand the following terms:

For a given data set that has 100 rows and 10 columns (assuming each column is a feature), how do I identify the following parameters:

  1. dimension: How do I know the dimension of this dataset?
  2. data point: Does it mean that every cell [row][col] is a data point or the whole row is one data point (vector of points)?
gsamaras
  • 71,951
  • 46
  • 188
  • 305
Frank
  • 25
  • 6
  • Every dimension corresponds to a feature, a data point is a row, i.e. a point in that NC-dimensional space. –  Sep 04 '16 at 15:00

2 Answers2

0

You have to see your dataset from a computation-geometry point of view, where every element of your dataset is a point in a D dimensional space.

Your dataset looks like this, I guess:

row0.col0 row0.col1 ... row0.col9
...
row99.col0 row99.col1 ... row99.col9

From a view, I would interpret this dataset as 100 points, in 10 dimensions.


Dimension

It's the number of columns, so 10. Every column is a coordinate from the mathematical view! ;)

Data point

Every row is a data point! Every cell is a coordinate of this point!


For example, check my minimal example here, you will see I create 10000000 points (that's the rows in your case), in 64 dimensions (that's the columns in your case).

Community
  • 1
  • 1
gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • 1
    Oh GREAT. Thank you very much. So If I want to pick initial random centroids for 5 clusters, I'd choose randomly 5 rows such that each centroid goes for one cluster. Right? – Frank Sep 04 '16 at 02:48
  • @Frank every centroid is a point in the same dimensional space as the points of your dataset. So yes, you can pick 5 points at random from your dataset, and use them as the 5 initial random centroids (so you will pick 5 rows from your dataset). You can easily do this with a random generator from [0, 99], since you have 100 points. For example, I would [select randomly like this in C](https://gsamaras.wordpress.com/code/random-numbers-%E2%88%88min-max/). – gsamaras Sep 04 '16 at 02:53
  • Thank you very much for your comments. Really helpful. I have another question. In fact, I still do not (really) understand why do we call the entire row (point). When talking about points, I think of (x,y) point in the data plane. – Frank Sep 04 '16 at 02:56
  • @Frank exactly! You think of `(x, y)` as a *point*! That's in 2D (the plane). ;) So in 2D, for a tiny dataset of 7 points, you would have 7 rows and 2 columns. Makes sense? :) – gsamaras Sep 04 '16 at 02:58
  • 1
    It DOES make sense! I just understand now. I really appreciate your help – Frank Sep 04 '16 at 03:07
0

It depends.

But most languages and file formats (e.g. CSV) use one row per record and one column per dimension. This spreadsheet view is very common.

E.g. in Java, most people would read a double[100][10] matrix as 100 records, 10 dimensions each.

Some languages are different. Matlab and Julia are column-major IIRC, so there a shape of (100,10) is 100 dimensions, 10 rows.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194