7

I am currently looking for some tool that would generate datasets of different shapes like square, circle, rectangle, etc. with outliers for cluster analysis.

Can any one of you recommend a good dataset generator for cluster analysis? Is there anyway to generates such datasets in languages like R?

Jeromy Anglim
  • 33,939
  • 30
  • 115
  • 173
Pradeep
  • 555
  • 8
  • 14

3 Answers3

7

You should probably look into the mlbench package, especially synthetic dataset generating from mlbench.* functions, see some examples below.

enter image description here

Other datasets or utility functions are probably best found on the Cluster Task View on CRAN. As @Roman said, adding outliers is not really difficult, especially when you work in only two dimensions.

chl
  • 27,771
  • 5
  • 51
  • 71
6

I would create a shape and extract bounding coordinates. You can populate the shape with random points using splancs package.

Here's a small snippet from one of my programs:

# First we create a circle, into which uniform random points will be generated (kudos to Barry Rowlingson, r-sig-geo).
circle <-  function(x = x, y = y, r = radius, n = n.faces){
    t <- seq(from = 0, to = 2 * pi, length = n + 1)[-1]
    t <- cbind(x = x + r * sin(t), y = y+ r * cos(t))
    t <- rbind(t, t[1,])
    return(t)
}

csr(circle(0, 0, 100, 30), 1000)

alt text

Feel free to add outliers. One way of going about this is sampling different shapes and joining them in different ways.

Community
  • 1
  • 1
Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197
1

There is a flexible data generator in ELKI that can generate various distributions in arbitrary dimensionality. It also can generate Gamma distributed variables, for example.

There is documentation on the Wiki: http://elki.dbs.ifi.lmu.de/wiki/DataSetGenerator

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194