1

I am evaluating an algorithm, and would like to use artificial data.

The algorithm works fine, for one dimensional artificial datasets, as seen in this StackOverflow answer.

I would like to test the algorithm for datasets with more than one dimension and certain characteristics (e.g. noise, correlation). Did someone already implement an ‘artificial dataset generator’ in R?

Any feedback would be very much appreciated. Thanks!

Community
  • 1
  • 1
cs0815
  • 16,751
  • 45
  • 136
  • 299

2 Answers2

2

You could use wakefield package to generate random data sets.

It allows easy creation of data frames, time series, adjusting correlations, and even visualizing generated data, e.g.:

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/wakefield")
pacman::p_load(dplyr, tidyr, ggplot2)

set.seed(10)

r_data_frame(n=100,
    id,
    dob,
    animal,
    grade, grade,
    death,
    dummy,
    grade_letter,
    gender,
    paragraph,
    sentence
) %>%
   r_na() %>%
   plot(palette = "Set1")

enter image description here

epo3
  • 2,991
  • 2
  • 33
  • 60
  • 1
    That picture is not helpful without the actual code that generated it. I suggest you'd add the relevant information or else this will go into Very Low Quality answers queue – David Arenburg Dec 31 '16 at 16:32
  • will do. but that means duplicating the code from the author's manual. – epo3 Jan 01 '17 at 00:22
1

The mlbench package in R is a collection of functions for generating data of varying dimensionality and structure for benchmarking purposes. It includes both regression and classification data sets.

Of course, these data sets are all fairly artificial and so they may not really reflect "real life" performance, since they may not mirror the sort of structure that your algorithm is intended for. But it's a place to start, at least.

joran
  • 169,992
  • 32
  • 429
  • 468
  • Thanks. This seems to be a collection of datasets (I have used UCI before). I am more interested in a generator so that I can see under which conditions (dataset properties) the algorithm's performance starts to crumble. Artificial dataset also allow me to measure calibration something that is impossible (IMHO) using existing datasets where the 'truth' (formula) is unknown. Thanks. – cs0815 Jan 23 '12 at 16:42
  • @csetzkorn Look more closely. mlbench contains _generator_ functions with parameters that control things like sd, the centers of the cuboids, etc. Now, as I said, it is unlikely that someone _else_ will have magically created a function to generate artificial datasets in exactly the manner you'd like. If that's what you want, you'll have to code it up yourself. – joran Jan 23 '12 at 16:46
  • sorry I did not see the generator bit. Thanks – cs0815 Jan 23 '12 at 17:10