I have a dataset where each row represents a citation to an article, along with the difference between the publication date of the article and each reference, like so:
EID ref delta
1 2-s2 r1 0
2 2-s2 r2 3
3 2-s2 r3 22
4 2-s2 r4 100
5 2-s2 r5 7
6 3-s2 r6 1
7 3-s2 r7 0
8 3-s2 r8 1
I want to determine for each distinct EID, how many references fall in different ranges of year deltas (I.e. for a given article, how many references are 1 year old, 2 years old, 4 years old, etc?). I attempted to create buckets for each:
buckets=c(0,1,2,4,8,16,32,64,9999)
bt=bt %>%
mutate(delta = as.numeric(delta)) %>%
mutate(bucket=cut(delta, breaks = buckets))
group = bt %>%
group_by(EID, bucket) %>%
summarise(count=n())
The resulting grouped data is:
EID bucket count
1 2-s2 (1,2] 6
2 2-s2 (2,4] 8
3 2-s2 (4,8] 16
4 2-s2 (8,16] 18
5 2-s2 (16,32] 10
6 3-s2 (1,2] 1
7 3-s2 (2,4] 13
8 3-s2 (4,8] 1
9 4-s1 (4,8] 3
I would like to create a column for each bucket I have, and then group by EID, placing the appropriate count in the appropriate bucket for each EID, where the result looks something like this:
EID (1,2] (2,4] (4,8] (8,16] (16,32]
1 2-s2 6 8 16 18 10
2 3-s2 1 13 1 0 0
2 4-s1 0 0 3 0 0
Looking at the code I used to generate the first table, it seems like I should be able to use unstack(group, bucket~count)
somehow, or just directly automate the creation of these bucket columns using summarise()
but I'm not clear on exactly how to do so. Ideally, I would not have to hard-code in each column; I would like to be able to reference the bucketing list, so if I decide to change the bucketing scheme, it will update accordingly. Thank you!