0

I have a dataset of patents, where I have recorded 1) the month and year associated with a patent renewal and 2) whether the patent holder chose to pay the patent fee or let the patent lapse. So

patentid        fee1date         fee1paid    fee2date    fee2paid
1               May 2010       True        May 2013    False 
2               May 2010       True        April 2014  True

What I want to do is develop a count of the number of renewals by month and by year, as well as the number of abandoned patents, as follows:

date      renewed              lapsed
May 2010  2                   0

How might I count the data that I have now? Thank you!

EDIT: The key point is to aggregate these across different columns. The issue that I am running into now is that when I try using the count library, it treats 2 renewals in May 2010 as two separate values.

Krabo
  • 31
  • 6
  • Minor advice - try to avoid capitals and spaces in your column names – tjebo Jul 16 '18 at 13:12
  • And - how do you get to `No Patents Renewed 11` from your example data frame? This looks like magic to me. Please explain! – tjebo Jul 16 '18 at 13:13
  • 1
    Upper case is ok, but spaces cause trouble – s_baldur Jul 16 '18 at 13:14
  • I agree, but upper case can be pain for typing. :) – tjebo Jul 16 '18 at 13:15
  • Sorry, it was a dummy value that I included :P – Krabo Jul 16 '18 at 13:16
  • how u realize that patent abandoned in 2010 ? you just know that abandon time is between 2010 and 2013 ! – Saman Jul 16 '18 at 13:18
  • Sorry SaMi GiMiX, I don't quite understand your comment. Most patent offices release information about filing fees online, so it is possible to work backwards to see whether a patent has been renewed or not. – Krabo Jul 16 '18 at 13:20
  • It seems that `lapsed` should be 1 because the `fee2paid` for line 1 is FALSE. Am I missing something? – akash87 Jul 16 '18 at 13:44
  • No, because I am looking at the number of renewal statistics by month. Because there is no lapse in May 2010, the count for that month is 0. – Krabo Jul 16 '18 at 13:50

1 Answers1

4

Using dplyr

require(tidyr)
require(dplyr)      

data %>% gather(year,value, -Patent.ID) %>% 
         separate('year',c('Fee','N','Act')) %>% 
          spread(Act,value) %>% 
          unite(Fee, Fee,N, sep = '.') %>% 
          group_by(Date) %>% 
          summarise(R=sum(Paid=='True'), NotR=sum(Paid=='False'))

     # A tibble: 3 x 3
  Date           R  NotR
  <chr>      <int> <int>
1 April 2014     1     0
2 May 2010       2     0
3 May 2013       0     1

Data

data <- read.table(text="
               'Patent ID'      'Fee 1 Date'   'Fee 1 Paid'    'Fee 2 Date'   'Fee 2 Paid'
               1               'May 2010'       True        'May 2013'    False 
               2               'May 2010'       True        'April 2014'  True

               ",header=T, stringsAsFactors = F)
tjebo
  • 21,977
  • 7
  • 58
  • 94
A. Suliman
  • 12,923
  • 5
  • 24
  • 37
  • Hi A. Suliman, thanks for the help! I've tried your approach but unfortunately I get the following error message: `In addition: Warning message: attributes are not identical across measure variables; they will be dropped` – Krabo Jul 16 '18 at 14:32
  • Check https://stackoverflow.com/questions/28972386/retain-attributes-when-using-gather-from-tidyr-attributes-are-not-identical – A. Suliman Jul 16 '18 at 14:48
  • Eventually I got there, I used the plyr count and aggregate function to add each of the columns together. Thanks for your help though! – Krabo Jul 16 '18 at 22:55