1

I have read about all Q&A on rolling sums on this website but I can't make sense of most of the complex code so my tweaking skills are limited. I tried implementing a few solutions proposed, here, here, and here amongst others but either I get errors or my computer crashes, even when I only use 1,000 rows and 3 columns. Hence clearly, I mess up the code.

My data looks like this (first 50 rows via dput). Total dataset is about 100,000 rows

           structure(list(pnum = c("4778744", "4778744", "4778744", "4832724", 
"4840655", "4854957", "4952026", "4832724", "4832724", "4840655", 
"4952026", "4854957", "4952026", "4979975", "5062877", "5062877", 
"4979975", "4979975", "4979975", "5093287", "5148510", "5093287", 
"5148510", "5093287", "5148510", "5093287", "5148510", "5093287", 
"5148510", "5093287", "5148510", "5093287", "5148510", "5212120", 
"5375012", "5168079", "5375012", "5212120", "5212120", "5168079", 
"4811345", "4851990", "4947366", "5142672", "5317715", "4878166", 
"4851990", "5142672", "5317715", "4878166", "5142672", "5317715", 
"4878166", "5142672", "5317715", "4878166", "5142672", "5317715", 
"4878166", "5185878", "4926323", "4926323", "4926323", "4926323", 
"5185878", "4926323", "4926323", "4926323", "4926323", "4926323", 
"4926323", "5129067", "5136697", "5210841", "5237700", "5237700", 
"5237700", "5247644", "5805912", "5828869", "5357626", "5247644", 
"5805912", "5828869", "5357626"), ID = c("03859643-1", "04488864-4", 
"04560399-1", "03859643-1", "03859643-1", "03859643-1", "03859643-1", 
"03901719-2", "04086089-2", "04086089-2", "04407934-2", "04488864-4", 
"04952026-3", "03859643-1", "03859643-1", "03901719-2", "03912481-3", 
"03940277-1", "04979975-2", "03859643-1", "03859643-1", "03864113-1", 
"03864113-1", "04877300-1", "04877300-1", "04877300-3", "04877300-3", 
"05040862-3", "05040862-3", "05093287-4", "05093287-4", "05093287-6", 
"05093287-6", "03859643-1", "03859643-1", "03859643-1", "03870399-2", 
"03901719-2", "03923529-1", "04784976-1", "03860454-2", "03860454-2", 
"03860454-2", "03860454-2", "03860454-2", "03860454-2", "04761567-2", 
"04870622-2", "04870622-2", "04870622-2", "04878166-2", "04878166-2", 
"04878166-2", "04878166-3", "04878166-3", "04878166-3", "04878166-5", 
"04878166-5", "04878166-5", "03860454-2", "03860454-2", "04610004-1", 
"04734852-2", "04734852-3", "04761567-2", "04761567-2", "04777587-1", 
"04835414-1", "04878166-2", "04926323-10", "04926323-5", "03860454-2", 
"03860454-2", "03860454-2", "03860454-2", "05237700-2", "05237700-3", 
"03860454-2", "03860454-2", "03860454-2", "03860454-2", "04731737-1", 
"04731737-1", "04731737-1", "04731737-1"), Time = c(1986L, 1986L, 
1986L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 
1988L, 1988L, 1989L, 1989L, 1989L, 1989L, 1989L, 1989L, 1990L, 
1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 
1990L, 1990L, 1990L, 1990L, 1991L, 1991L, 1991L, 1991L, 1991L, 
1991L, 1991L, 1986L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 
1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 
1987L, 1987L, 1987L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 
1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1989L, 1989L, 1990L, 
1990L, 1990L, 1990L, 1991L, 1991L, 1991L, 1991L, 1991L, 1991L, 
1991L, 1991L)), .Names = c("pnum", "inventor", "pryear"), row.names = c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 
16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 
29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 325L, 
326L, 327L, 328L, 329L, 330L, 331L, 332L, 333L, 334L, 335L, 336L, 
337L, 338L, 339L, 340L, 341L, 342L, 343L, 344L, 345L, 346L, 347L, 
348L, 349L, 350L, 351L, 352L, 353L, 354L, 355L, 356L, 357L, 358L, 
359L, 360L, 361L, 362L, 363L, 364L, 365L, 366L, 367L, 368L, 369L
), class = "data.frame")

Multiple inventors collaborate on a project pnum in a specific year called pryear. I am looking for three things:

After comments from @Thierry I changed the data sample to ensure that the problem he pointed out was dealt with.

  1. The number of projects conducted by each individual inventors in an x (say 3) year window before the current pryear, thus if year of current project is 1977, I want the number of projects conducted from 1974 until 1976 included. If there are no occurrences before, ideally the result would be '0'. the answer provided by @Alex here can be used to achieve this first goal. But as discussed in the comments, it is not highly efficient (especially as my time range is from 1952 to 2010 with over 50,000 inventors).
  2. The total number of different inventors with whom each inventor has worked in that same time window
  3. If a project has multiple inventors, I am looking for the number of times each inventor has collaborated with the other inventors who are working on the current project during the same past time window
Community
  • 1
  • 1
SJDS
  • 1,239
  • 1
  • 16
  • 31
  • Please provide a better set of sample data. This set contains no data that matches your first objective. – Thierry Dec 07 '15 at 14:32
  • @Thierry, thanks for your reply. The example does contain all the info I think. What could be added as a fourth column is a dummy variable that is 1 for each row. Every invention (number in first column) is one project that is executed by the inventors (strings in second column) in a specific year (number in third column). I did notice a problem when copy-pasting the data so I changed the upload to include also full dates. Hope this clarifies ? – SJDS Dec 08 '15 at 01:01
  • No it doesn't contain all the info. None of the inventors has projects from multiple years. `rowSums(with(df, table(inventors, pryear)) > 0)` – Thierry Dec 09 '15 at 10:01
  • @Thierry, thanks for clarifying. I had missed this completely.This was idiosyncratic to that part of the data sample. Just uploading a new one. I based the sample now on a combo of two prolific `inventors`. – SJDS Dec 09 '15 at 10:32

1 Answers1

0

Here is a solution for you first question. You can solve the other ones as an exercise.

The first solutions uses only dplyr. You will probably run into problems with large datasets.

library(dplyr)
df %>% 
  inner_join(
    df %>% 
      select(inventor, oldyear = pryear), 
    by = "inventor") %>% 
  filter(pryear - 3 <= oldyear, oldyear < pryear) %>% 
  group_by(inventor, pryear) %>% 
  summarise(projects = n())

The second solutions use dplyr with a database back-end. That should be able to cope with larger datasets. Note that the code is very similar.

library(RSQLite)
library(dplyr)
conn <- dbConnect(SQLite(), "test")
dbWriteTable(conn, "project", df)
src <- src_sqlite("test")
tbl(src, "project") %>% 
  inner_join(
    tbl(src, "project") %>% 
      select(inventor, oldyear = pryear), 
    by = "inventor") %>% 
  filter(pryear - 3 <= oldyear, oldyear < pryear) %>% 
  group_by(inventor, pryear) %>% 
  summarise(projects = n()) %>% 
  collect()
Thierry
  • 18,049
  • 5
  • 48
  • 66
  • One minor tweak I think is necessary to really answer my first question. It should be `filter(pryear - 3 < oldyear, oldyear < pryear) %>% `, so the `<=` needs to become `<`. If not all the projects done in the pryear are included in the outcome. – SJDS Jan 11 '16 at 09:42