restrict the number of rows for a particular string in a data frame to some number n in R

Question

I have a data frame(data1) whose structure is:

   Time-period   income
   Jan.2000      15000
   Jan.2000      23000
   Jan.2000      17400
   Jan.2000      11111
         .         .
         .         .
   Feb.2000      37000
   Feb.2001      39000
   Mar.2001      39000
   Mar.2001      39000
         .           .
         .           .

Thus for each month and year there are nentries in the data frame data1. What I want is that for each month of each year there should be only atmost 2(top 2 if more than 2 are there) entries. Thus for Jan. 2000 I want to have only top 2 entries and remove rest of the entries from data frame data1. Similarly for every month of every year wherever there are more than two entries. How do I do that?

UPDATE: Here is my sample data set:

   Time-period   income
   Jan.2000      15000
   Jan.2000      23000
   Jan.2000      17400
   Feb.2000      37000
   Feb.2000      39000
   Mar.2000      39000
   Mar.2000      39000
   Mar.2000      39500
   Jan.2000      11111
   Apr.2000      39000

After filtering I should get:

       Time-period   income
       Jan.2000      15000
       Jan.2000      23000
       Feb.2000      37000
       Feb.2000      39000
       Mar.2000      39000
       Mar.2000      39000
       Apr.2000      39000

Notice that how last 2 entries of Jan. 2000(one of which is after Mar. 2000 so time-period is not ordered) and last entry of Mar. 2000 got removed but for Feb. 2000 and Apr 2000 no deletion happened as there number of entries was <= 2

I'm guessing by top 2, you mean the two largests incomes. It would be more helpful if you could create a sample data set we can copy/paste into R and also explicitly give the desired output for that sample input in your question. See [how to make a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for tips and suggestions on how to do just that. — MrFlick, Aug 13 '14 at 19:33
@MrFlick by top 2 I meant first 2 entries. The income column is of no importance for filtering. I have put a sample data in my post above. — user2966197, Aug 13 '14 at 19:41
You sample data is still a bit confusing. What class is your time-period column? Is it a character? date? factor? Is it properly ordered? Plus the space in the middle of the value makes it annoying to try to import into R. I'm guessing something like `do.call(rbind, by(dd, dd$Time.period, head ,2))` would work but you've made it difficult to test. — MrFlick, Aug 13 '14 at 19:47
@MrFlick the `time-period` is of character type and it is not ordered.So entries for `Jan. 2000` are not together and may be spread out. — user2966197, Aug 13 '14 at 19:51
@MrFlick which space are you talking about? the data set is in update section in above post — user2966197, Aug 13 '14 at 20:17
The space in "Jan. 2000". When you using `read.table` on a value like that, R will interpret it as two columns but yet you've provided a single header without a space. It's just annoying to have to edit all the rows in order to be able to import the data into R to test different solutions. — MrFlick, Aug 13 '14 at 20:21
@MrFlick I apologize for that but this is the format in which data is present in my data frame. I will try to edit the post and manually delete the space for you. — user2966197, Aug 13 '14 at 20:33

score 1 · Accepted Answer · answered Aug 13 '14 at 20:45

1

So with the sample data.frame

dd<-data.frame(Time.period = c("Jan.2000", "Jan.2000", "Jan.2000", 
"Feb.2000", "Feb.2000", "Mar.2000", "Mar.2000", "Mar.2000", "Jan.2000", 
"Apr.2000"), income = c(15000, 23000, 17400, 37000, 39000, 
39000, 39000, 39500, 11111, 39000))

You can run the code i suggested in the comments with a slight adjustment

do.call(rbind, by(dd, factor(dd$Time.period, levels=unique(dd$Time.period)),
    head ,2))

This will return

           Time.period income
Jan.2000.1    Jan.2000  15000
Jan.2000.2    Jan.2000  23000
Feb.2000.4    Feb.2000  37000
Feb.2000.5    Feb.2000  39000
Mar.2000.6    Mar.2000  39000
Mar.2000.7    Mar.2000  39000
Apr.2000      Apr.2000  39000

it does create odd row.names but you can get rid of those if you like with something like rownames(x)<-NULL depending on where you save the result.

answered Aug 13 '14 at 20:45

MrFlick

195,160
17
277
295

is it possible to make this code not to reorder data based on `Time.period` alphabetically? what it does is reorder the entries so `Oct.1999` is coming after `Apr. 2000` but I do not want it to reorder. Actually my data starts from `Oct. 1999` and goes till `Dec. 2001` and I want this order not be changed – user2966197 Aug 13 '14 at 21:05
The code I posted here does not reorder. Can you provide a more representative sample data set. You said time.period was character right? Not a factor? – MrFlick Aug 13 '14 at 21:07
Sorry for the error.It's executing correctly. By mistake I had not put factor and hence it was reodering it. Thanks for help! – user2966197 Aug 13 '14 at 21:23
If you needed to re-order the factor again based on appearance of values, you could do `factor(dd$Time.period, levels=unique(as.character(dd$Time.period)))` in the code above. – MrFlick Aug 13 '14 at 22:22

restrict the number of rows for a particular string in a data frame to some number n in R

1 Answers1