Why do I get duplicate data when I create a random sample from an existing dataset within R?

Question

I wanted to understand what is wrong with my syntax when I try to sample data from my dataset which has 5000 rows, I only want to random sample 500 from it.

repex of dataset(xdata):

 AccountId              Street         City State ZipCode CloseFactorPct OpenFactorPct ZipIncome ZipDegree
1    455697 3919 Birkdale Ln Se      Olympia    WA   98501           0.75         1.40       67060      0.17879866
2    490095      29174 Wagon Rd Agoura Hills    CA   91301           0.85         2.50      115125      0.21376952
3    427399   301a Franklin Ave    Princeton    NJ    8540           0.80         2.25      124954      0.50428200
4    470678  1461 Woodsview Way      Macedon    NY   14502           0.80         2.50       67780      0.13772373
5    424824      616 Locust Ave   Las Animas    CO   81054           0.80         2.25       31343      0.02021198
6    437343    13 New Oxford Rd       Conway    AR   72034           0.80         2.25       51435      0.15904222
  TotalOwed
1          0.0
2        185.1
3       1645.0
4          0.0
5          0.0
6          0.0
>

My code:

sample2 <- xdata[sample(nrow(xdata), "500", replace=T), sample(ncol(xdata), 10, replace=T)]

head(sample2)


   ZipIncome         City ZipIncome.1 TotalOwed                       Street OpenFactorPct ZipHhIncome.2
14470       41866     Columbus         41866       841.31          792 Dennison Avenue         0.85         41866
23502       55221      El Paso         55221         0.00 12949 Eastbrook Drive Apt 53         0.70         55221
7370        93373 Saddle Brook         93373       570.38              229 S Boulevard         0.70         93373
31627       61830    Choudrant         61830      1156.28             153 Jones Street         0.70         61830
29840       39697      Beckley         39697         0.00            2109 S Kanawha St         0.75         39697
14938       91313    Bradenton         91313         0.00               5007 Serata Dr         0.85         91313
      ZipIncome.3 ClosedFactorPct ZipIncome.4
14470         41866           0.95         41866
23502         55221           0.80         55221
7370          93373           1.20         93373
31627         61830           0.80         61830
29840         39697           0.80         39697
14938         91313           1.30         91313

The output I receive gives me 4 duplicates of zipincome. Why does this happen? can someone help me understand if my syntax to pull out a random sample is incorrect or if I require to set.seed()?

You may set the arguemnt `replace` to `FALSE` which is the default. Don't forget `set.seed(42)` for reproducibility. — markus, Jul 10 '19 at 14:06
When I do that then the same happens but with zipcode @markus If you were to do just pick a random sample from a dataset what would you change in my code or what would your code be? — NewInPython, Jul 10 '19 at 14:10
Try `set.seed(42); xdata[sample(nrow(xdata), 500), sample(ncol(xdata), 10)]` — markus, Jul 10 '19 at 14:11
@markus could you kindly explain the syntax and why set.seed(42) and not set.seed(500)? — NewInPython, Jul 10 '19 at 14:13
[The number of the seed doesn't matter](https://stackoverflow.com/questions/13605271/reasons-for-using-the-set-seed-function). `replace = FALSE` simply means you sample without replacement. — markus, Jul 10 '19 at 14:16
For the 42, see https://en.wikipedia.org/wiki/Phrases_from_The_Hitchhiker%27s_Guide_to_the_Galaxy#Answer_to_the_Ultimate_Question_of_Life,_the_Universe,_and_Everything_(42) — markus, Jul 10 '19 at 14:22
If there are duplicate `zipincome` values in your data, then when you do a random sample there is a possibility that your sample will also include duplicate `zipincome` values, even if you sample without replacement. — Gregor Thomas, Jul 10 '19 at 14:29
Thank you @markus but the syntax if putting 10 in the end is that only because number of cols = 10? — NewInPython, Jul 10 '19 at 14:30
Your code suggests that you want to sample 500 rows and 10 columns from your data. If you only want to to sample from the rows try `set.seed(42); xdata[sample(1:nrow(xdata), 500), ]`. Is this what you need? — markus, Jul 10 '19 at 14:33
No, not just rows, even column headers. your code is correct. which I could mark it as correct. — NewInPython, Jul 10 '19 at 14:40
I agree with @markus - you have effectively sampled 10 columns randomly with `replace = T`, remove the bits about sampling columns and see what happens. If you want to sample the data to include all columns just use the `nrow` bit. — bob1, Jul 10 '19 at 17:06

Why do I get duplicate data when I create a random sample from an existing dataset within R?

0 Answers0