1

I have a data.frame with 10 cols and about 700K rows.
I want to use the pairs(data.frame) function to show a pairwise scatterplot of the column values. It is not necessary (or feasible) to plot all 700K rows in each plot so I'd like to select a random subset of say 2 or 3K (some small number) of rows to be plotted.

Can someone please assist with my options to select a small random subset of my data frame. I think that either

  1. a random subset of X% of the data.frame or
  2. every Nth row would work.

    I know I've seen this done but can't locate the code snippet ....

thanks

digEmAll
  • 56,430
  • 9
  • 115
  • 140
Robert Lewkovich
  • 223
  • 1
  • 5
  • 11

2 Answers2

3

The important question is: will a random subset of your rows accurately describe the entire dataset?
Until we understand what your data represent (time sequences vs. random samplings, or something else) , it's difficult to provide proper advice as to the right subset to plot.

Would you be better off, e.g., creating a function via splinefun for each column and generating a plot of fitted data at uniform spacings from min to max?

Carl Witthoft
  • 20,573
  • 9
  • 43
  • 73
  • +1, Normally I would point out that this is a comment not an answer, but it is a great point and worthy of an upvote – Ricardo Saporta Nov 08 '13 at 15:36
  • @RicardoSaporta thanks -- I was not sure which way to post; went w/ "answer" because I have high hopes that a spline fit will improve the final product. – Carl Witthoft Nov 08 '13 at 16:18
  • Good point. The data is time series data so in this case a random sample might not provide an accurate picture of the data. – Robert Lewkovich Nov 08 '13 at 17:26
1

Would something like this work?

a <- sample(1:700000,10) # option 1
a <- seq(1, 700000, by = 200) # option 2

Then the subset can be obtained thus -

randomssubset <- df[a,]
TheComeOnMan
  • 12,535
  • 8
  • 39
  • 54