1

I need to test some imputation evaluation software I'm creating and am struggling to get benchmark datasets.

Does anyone know of a way to delete a certain amount of data from a dataframe.

As an example of what I need:

You have a dataset and you want a random 20% of the rows to have a random amounts of variables in that row removed (ie. NA)

Or: Something that can turn

> head(mtcars,n=10)
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

Into:

> head(mtcars,n=10)
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4          NA    6 160.0  NA 3.90 2.620   NA   0  1    4    4
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8  NA 108.0  93  NA    NA  18.61  NA 1    NA   1
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

I have tried a couple of methods that manipulate the columns but these have some fundamental flaws in them which render them useless.

This is my first every question on here, if I have missed out anything or done something wrong, please do let me know.

All the best

abdnChap
  • 206
  • 1
  • 15
  • Welcome to StackOverflow. Please have a look [here](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and try to give a minimal reproducible example that may help others to help you. – Ronak Shah Oct 05 '16 at 10:47
  • Please see http://stats.stackexchange.com/questions/184741/how-to-simulate-the-different-types-of-missing-data – m-dz Oct 05 '16 at 10:50
  • Hello Ronak, is that what you were thinking about in terms of reproducible example? m-dz, thank you for the link, I am looking through it now to see if I can use that. – abdnChap Oct 05 '16 at 11:06
  • @abdnChap yes, perfect! – Ronak Shah Oct 05 '16 at 11:12
  • I think you should also think about the mechanism of missingness in your data. Are you suggesting data missing at random, missing completely at random or missing not at random? – Wietze314 Oct 05 '16 at 11:24

1 Answers1

1

This should do it:

df_new <- as.data.frame(apply(mtcars,2,function(x){
    x[sample(1:length(x),round(length(x)*0.2))] <- NA
    return(x)
}))

Apply() goes through the columns and in each column sample() is used to randomly select 20% of the values to be set to NA.

New answer after comment:

This randomly adds NA in 10% of all rows.

df <- mtcars
random_rows <- sample(1:nrow(df),round(nrow(df)*0.2))
for(i_row in random_rows){
    df[i_row,sample(1:ncol(df),sample(1:ncol(df),1))] <- NA
} 
tobiasegli_te
  • 1,413
  • 1
  • 12
  • 18
  • Hello! Thank you for your answer but I don't think this will work for what I need. I already created a function that can removed a random percentage from all columns but what I need is something that can change 10% of the rows, if you removed 10% randomly from each column, you change more than 10% of the rows. And if you set this to 40% of more, then all rows are changed. – abdnChap Oct 05 '16 at 11:13
  • So you mean that 10% of the rows contain exactly one NA or at least one NA? – tobiasegli_te Oct 05 '16 at 11:15
  • A random 10% of the rows should have a random amount of NA in them. – abdnChap Oct 05 '16 at 11:17
  • I'm just going to play with your suggestion for a little bit before accepting it, but by the looks of it, it does exactly what I needed. I wasn't aware that one could use sample to change the dataframe it came from. – abdnChap Oct 05 '16 at 11:26
  • Glad to help. By the way, if you want to make your results reproducable, add `set.seed(42)` before running `sample()`. If you re-run at a later timepoint, the random number generator will then draw the same random numbers. – tobiasegli_te Oct 05 '16 at 11:29
  • You're a star! It works great and the seed is great advice. Many thanks – abdnChap Oct 05 '16 at 11:31