I have two data set
merge_paper_author:
author_id paper_id
731 136
731 198
778 241
........
paper_author:
paper_id author_id
136 731
136 726
136 728
138 731
138 726
138 777
138 798
140 728
140 726
140 731
In the merge_paper_author for paper_id 136 author_id is 731.Now i need to take this value of paper_id=136 and author_id=731 and look into the paper_author table.For paper_id=136 i have 3 authors,i.e author_id=731,726 and 728.For this I will have 2 combinations of author_ids(considering 731 fixed)- 731, 726 and 731,728.
Now I need to look this combination across all papers in paper_author and find such combination and count them.For eg- for paper_id=138 i have 3 author_ids=731,777,726.In this there is 1 count of 731 and 726.In paper_id=140 there is combination of author_id=731,728 and 726 so count is 2 in this case(731,726 and 731 and 728).So total count is 2+1=3.i also need the no of times such occurences occur.
This i need to do for every such author_id , paper_id combination in merge_paper_author.I have written a code but its highly time consuming.I can do with R or sql query.My r code is:
count_multiple_sin_pap<-vector()
co_authored_comb_sin_pap<-vector()
no_coauthors_comb<-vector()
data_papers<-merge_paper_author$paper_id
data_authors_sin<-merge_paper_author$author_id
data_i<-vector()
for(i in 1:length(data_papers))
{
index<-which(paper_author$paper_id%in%data_papers[i])
if(length(index)>1){
authors<-paper_author$author_id[index]
if(length(unique(authors))==1){data_i<-rbind(data_i,i)}
else{
index1<-which(paper_author$author_id%in%data_authors_sin[i])
author<-paper_author$author_id[index1]
index2<-which(authors==data_authors_sin[i])
authors<-authors[-index2]
index3<-which(paper_author$author_id%in%data_authors_sin[i])
index4<-which(paper_author$paper_id%in%paper_author$paper_id[index3])
index5<-which(paper_author$author_id[index4]%in%authors)
dat<-paper_author$author_id[index4][index5]
dat<-data.frame(dat)
agg<-aggregate(list(numdup=rep(1,nrow(dat))), dat, length)
count_multiple_sin_pap[i]<-length(which(agg$numdup>1))
co_authored_comb_sin_pap[i]<-sum(agg$numdup)
no_coauthors_comb[i]<-nrow(agg)
}}else{count_multiple_sin_pap[i]<-0
co_authored_comb_sin_pap[i]<-0
no_coauthors_comb[i]<-0
}
}
In the above code I have taken two inputs data_papers and data_authors_sin
dput(data_papers):
1696062L, 1696062L, 1701471L, 1701471L, 1707481L, 1707481L, 1722811L,
1722811L, 1754307L, 1754307L, 1768287L, 1768287L, 1791496L, 1791496L,
1795108L, 1795108L, 1816428L, 1816428L, 1844081L, 1844081L, 1850488L,
1856769L, 1856769L, 1931620L, 1931620L, 1947922L, 1955753L, 1955753L,
dput(data_authors_sin):
2292903L, 2292903L, 2292903L, 2292903L, 2292903L, 2292903L, 2292903L,
2292903L, 2292903L, 2292903L, 2292903L, 2292903L, 2292903L, 2292903L,
2292903L, 2292903L, 2292903L, 2292903L, 2292903L, 2292903L, 2292903L,
2292903L, 2292903L, 2292903L, 2292903L, 2292903L, 2292903L, 2292903L,
Thanks