0

I have input table having more than 750 K raws. It has a field called quarter. I want to create sample such that I get 10% records from each quarter. Main attributes of the data.frame are:

  1. "SERIAL_NBR"
  2. "MODELNO"
  3. "War.Start.Monthly"

"Start.Qua.Yr" is the field where quarter is mentioned. Is there any way through which I can generate sample data which has data(10% of record) for each quarter?

Using sample function I can get sample regardless of the quarter. Code for the same will be:

raw_claim_input[sample(1:nrow(raw_claim_input),as.integer(nrow(raw_claim_input)/10)),]

When I am doing following for one quarter I am not getting expected results as there a logical problem while considering values

raw_claim_input[sample(1:nrow(raw_claim_input[raw_claim_input$War.Start.Monthly=="08-M2",]),as.integer(nrow(raw_claim_input[raw_claim_input$War.Start.Monthly=="08-M2",])/10)),]

The value 08-M2 is the filter, I want to do it for all the values available. There are 70 values for War.Start.Monthly, and I want to generate sample for each value of War.Start.Monthly.

Part of data

     Day.Covered           SHIP_DATE Warranty.Start.Qua.Yr War.Start.Monthly AssemblyDateUpdated Warranty.End.Date Warranty.End.Qur.Yr War.End.Monthly
252754         365    06-04-2008 00:00                 08-Q2             08-M6    06-03-2008 00:00        08-04-2064               64-Q2           64-M4
441605        1095 08-17-2010 11:13:07                 10-Q3             10-M8 08-16-2010 12:09:57        08-04-2064               64-Q2           64-M4
583636         731 10-17-2012 00:00:00                 12-Q4            12-M10 10-16-2012 00:00:00        08-04-2064               64-Q2           64-M4
115586         731    01-04-2013 00:00                 13-Q1             13-M1    01-03-2013 00:00        08-04-2064               64-Q2           64-M4
334221        1095 06-13-2011 12:29:23                 11-Q2             11-M6    06-11-2011 11:25        08-04-2064               64-Q2           64-M4
146656        1095 03-16-2011 10:54:37                 11-Q1             11-M3 03-15-2011 08:14:40        08-04-2064               64-Q2           64-M4
249956        1095 06-18-2008 12:35:06                 08-Q2             08-M6    06-06-2008 10:51        08-04-2064               64-Q2           64-M4
276295         731 05-18-2011 00:00:00                 11-Q2             11-M5 05-18-2011 00:00:00        19-11-2014               14-Q4          14-M11
582423         731 10-22-2012 00:00:00                 12-Q4            12-M10 10-22-2012 00:00:00        08-04-2064               64-Q2           64-M4
380369         730    08-04-2009 17:43                 09-Q3             09-M7 07-31-2009 07:14:17        18-01-2012               12-Q1           12-M1

Please let me know if more details needed.

Johan
  • 74,508
  • 24
  • 191
  • 319
vrajs5
  • 4,066
  • 1
  • 27
  • 44
  • 1
    take a look at `sample` function and read about `prob` argument which allows you to set probabilities for each number. – Jilber Urbina Oct 29 '13 at 12:05
  • @Jilber - Number of records for each quarter is not same, plus if I am not wrong, prob is the field where one can assign probability of each element getting selected, not on category like quarter in my case. – vrajs5 Oct 29 '13 at 12:15
  • 3
    @vrajs5 the willingness to help, quickly evaporates without a reproducible example. Please provide one and save those who can answer your question going back and forth over details which you should supply at the outset. Please read [**how to make a great reproducible example**](http://stackoverflow.com/q/5963269/1478381) and update your question accordingly. – Simon O'Hanlon Oct 29 '13 at 12:20
  • I agree with @SimonO101. Please also show us what you have tried. [Questions asking for code must include attempted solutions, why they didn't work, and the expected results.](http://stackoverflow.com/help/on-topic) – Henrik Oct 29 '13 at 12:23
  • @SimonO101 - Added lines of code here. I hope this might be helpful. – vrajs5 Oct 29 '13 at 12:40
  • @vrajs5 a sample of the data will be the *most* helpful thing you can add. Try adding the output from `dput( head( raw_claim_input , 10 ) )` It looks nonsensical, but we can copy that straight into our R session to recreate the first 10 rows of your data.frame. – Simon O'Hanlon Oct 29 '13 at 12:46
  • I have added 10 random record of the data. Sorry dput is generating 23mb chunk of data. So i have pasted limited records. – vrajs5 Oct 29 '13 at 13:00
  • 1
    @vrajs5: I was wondering if my answer did actually help you? – fotNelton Oct 30 '13 at 04:53

1 Answers1

1

This will do:

X <- read.csv(text="Day.Covered,SHIP_DATE,Warranty.Start.Qua.Yr,War.Start.Monthly,AssemblyDateUpdated,Warranty.End.Date,Warranty.End.Qur.Yr,War.End.Monthly
 365,    06-04-2008 00:00, 08-Q2,  08-M6,    06-03-2008 00:00, 08-04-2064 ,64-Q2,  64-M4
1095, 08-17-2010 11:13:07, 10-Q3,  10-M8, 08-16-2010 12:09:57, 08-04-2064 ,64-Q2,  64-M4
 731, 10-17-2012 00:00:00, 12-Q4, 12-M10, 10-16-2012 00:00:00, 08-04-2064 ,64-Q2,  64-M4
 731,    01-04-2013 00:00, 13-Q1,  13-M1,    01-03-2013 00:00, 08-04-2064 ,64-Q2,  64-M4
1095, 06-13-2011 12:29:23, 11-Q2,  11-M6,    06-11-2011 11:25, 08-04-2064 ,64-Q2,  64-M4
1095, 03-16-2011 10:54:37, 11-Q1,  11-M3, 03-15-2011 08:14:40, 08-04-2064 ,64-Q2,  64-M4
1095, 06-18-2008 12:35:06, 08-Q2,  08-M6,    06-06-2008 10:51, 08-04-2064 ,64-Q2,  64-M4
 731, 05-18-2011 00:00:00, 11-Q2,  11-M5, 05-18-2011 00:00:00, 19-11-2014 ,14-Q4, 14-M11
 731, 10-22-2012 00:00:00, 12-Q4, 12-M10, 10-22-2012 00:00:00, 08-04-2064 ,64-Q2,  64-M4
 730,    08-04-2009 17:43, 09-Q3,  09-M7, 07-31-2009 07:14:17, 18-01-2012 ,12-Q1,  12-M")

# Replicate X to have enough data for this example.
X <- X[rep(seq(nrow(X)), 100),]

# Partition the data according to quarter.
partitions <- split(X, X$Warranty.Start.Qua.Yr)
# Draw samples from each partition.
samples <- lapply(partitions, function(p) p[sample(nrow(p), nrow(p)/10),])
fotNelton
  • 3,844
  • 2
  • 24
  • 35