I have read about all Q&A on rolling sums on this website but I can't make sense of most of the complex code so my tweaking skills are limited. I tried implementing a few solutions proposed, here, here, and here amongst others but either I get errors or my computer crashes, even when I only use 1,000 rows and 3 columns. Hence clearly, I mess up the code.
My data looks like this (first 50 rows via dput). Total dataset is about 100,000 rows
structure(list(pnum = c("4778744", "4778744", "4778744", "4832724",
"4840655", "4854957", "4952026", "4832724", "4832724", "4840655",
"4952026", "4854957", "4952026", "4979975", "5062877", "5062877",
"4979975", "4979975", "4979975", "5093287", "5148510", "5093287",
"5148510", "5093287", "5148510", "5093287", "5148510", "5093287",
"5148510", "5093287", "5148510", "5093287", "5148510", "5212120",
"5375012", "5168079", "5375012", "5212120", "5212120", "5168079",
"4811345", "4851990", "4947366", "5142672", "5317715", "4878166",
"4851990", "5142672", "5317715", "4878166", "5142672", "5317715",
"4878166", "5142672", "5317715", "4878166", "5142672", "5317715",
"4878166", "5185878", "4926323", "4926323", "4926323", "4926323",
"5185878", "4926323", "4926323", "4926323", "4926323", "4926323",
"4926323", "5129067", "5136697", "5210841", "5237700", "5237700",
"5237700", "5247644", "5805912", "5828869", "5357626", "5247644",
"5805912", "5828869", "5357626"), ID = c("03859643-1", "04488864-4",
"04560399-1", "03859643-1", "03859643-1", "03859643-1", "03859643-1",
"03901719-2", "04086089-2", "04086089-2", "04407934-2", "04488864-4",
"04952026-3", "03859643-1", "03859643-1", "03901719-2", "03912481-3",
"03940277-1", "04979975-2", "03859643-1", "03859643-1", "03864113-1",
"03864113-1", "04877300-1", "04877300-1", "04877300-3", "04877300-3",
"05040862-3", "05040862-3", "05093287-4", "05093287-4", "05093287-6",
"05093287-6", "03859643-1", "03859643-1", "03859643-1", "03870399-2",
"03901719-2", "03923529-1", "04784976-1", "03860454-2", "03860454-2",
"03860454-2", "03860454-2", "03860454-2", "03860454-2", "04761567-2",
"04870622-2", "04870622-2", "04870622-2", "04878166-2", "04878166-2",
"04878166-2", "04878166-3", "04878166-3", "04878166-3", "04878166-5",
"04878166-5", "04878166-5", "03860454-2", "03860454-2", "04610004-1",
"04734852-2", "04734852-3", "04761567-2", "04761567-2", "04777587-1",
"04835414-1", "04878166-2", "04926323-10", "04926323-5", "03860454-2",
"03860454-2", "03860454-2", "03860454-2", "05237700-2", "05237700-3",
"03860454-2", "03860454-2", "03860454-2", "03860454-2", "04731737-1",
"04731737-1", "04731737-1", "04731737-1"), Time = c(1986L, 1986L,
1986L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L,
1988L, 1988L, 1989L, 1989L, 1989L, 1989L, 1989L, 1989L, 1990L,
1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L,
1990L, 1990L, 1990L, 1990L, 1991L, 1991L, 1991L, 1991L, 1991L,
1991L, 1991L, 1986L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L,
1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L,
1987L, 1987L, 1987L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L,
1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1989L, 1989L, 1990L,
1990L, 1990L, 1990L, 1991L, 1991L, 1991L, 1991L, 1991L, 1991L,
1991L, 1991L)), .Names = c("pnum", "inventor", "pryear"), row.names = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L,
16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L,
29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 325L,
326L, 327L, 328L, 329L, 330L, 331L, 332L, 333L, 334L, 335L, 336L,
337L, 338L, 339L, 340L, 341L, 342L, 343L, 344L, 345L, 346L, 347L,
348L, 349L, 350L, 351L, 352L, 353L, 354L, 355L, 356L, 357L, 358L,
359L, 360L, 361L, 362L, 363L, 364L, 365L, 366L, 367L, 368L, 369L
), class = "data.frame")
Multiple inventors
collaborate on a project pnum
in a specific year called pryear
. I am looking for three things:
After comments from @Thierry I changed the data sample to ensure that the problem he pointed out was dealt with.
- The number of projects conducted by each individual inventors in an x (say 3) year window before the current
pryear
, thus if year of current project is 1977, I want the number of projects conducted from 1974 until 1976 included. If there are no occurrences before, ideally the result would be '0'. the answer provided by @Alex here can be used to achieve this first goal. But as discussed in the comments, it is not highly efficient (especially as my time range is from 1952 to 2010 with over 50,000 inventors). - The total number of different inventors with whom each inventor has worked in that same time window
- If a project has multiple inventors, I am looking for the number of times each inventor has collaborated with the other inventors who are working on the current project during the same past time window