How to get random subset of rows of data frame

Question

I have a data.frame with 10 cols and about 700K rows.
I want to use the pairs(data.frame) function to show a pairwise scatterplot of the column values. It is not necessary (or feasible) to plot all 700K rows in each plot so I'd like to select a random subset of say 2 or 3K (some small number) of rows to be plotted.

Can someone please assist with my options to select a small random subset of my data frame. I think that either

a random subset of X% of the data.frame or
every Nth row would work.
I know I've seen this done but can't locate the code snippet ....

thanks

score 3 · Answer 1 · answered Nov 08 '13 at 15:33

3

The important question is: will a random subset of your rows accurately describe the entire dataset?
Until we understand what your data represent (time sequences vs. random samplings, or something else) , it's difficult to provide proper advice as to the right subset to plot.

Would you be better off, e.g., creating a function via splinefun for each column and generating a plot of fitted data at uniform spacings from min to max?

answered Nov 08 '13 at 15:33

Carl Witthoft

20,573
9
43
73

+1, Normally I would point out that this is a comment not an answer, but it is a great point and worthy of an upvote – Ricardo Saporta Nov 08 '13 at 15:36
@RicardoSaporta thanks -- I was not sure which way to post; went w/ "answer" because I have high hopes that a spline fit will improve the final product. – Carl Witthoft Nov 08 '13 at 16:18
Good point. The data is time series data so in this case a random sample might not provide an accurate picture of the data. – Robert Lewkovich Nov 08 '13 at 17:26

score 1 · Answer 2 · answered Nov 08 '13 at 14:50

1

Would something like this work?

a <- sample(1:700000,10) # option 1
a <- seq(1, 700000, by = 200) # option 2

Then the subset can be obtained thus -

randomssubset <- df[a,]

answered Nov 08 '13 at 14:50

TheComeOnMan

12,535
8
39
54

How to get random subset of rows of data frame

2 Answers2

Linked