I have one data frame of length 5923 and the second data frame of length 68709. The first data frame looks like this and the second data frame looks like this
Their common column is the first column "people_id".
So far i've done this:
#
# This R function merges two tables and returns the resulting table in a new data frame.
# inputs
# 1. tbl1 loaded from a csv file.
# 2. tbl2 is output from an query containing people_id and repository_id
# There can be multiple repository ids associated to each people id
#
mergetbl <- function(tbl1, tbl2)
{
# tbl1 -- from csv file
# tbl2 -- from sql query
# 1. create an empty data frame
# 2. go through tbl1 row by row
# 3. for each row in tbl1, look at the current people_id in tbl2 and extract all associated repository_id
# 4. duplicate the same row in tbl1 the same number of times there are associated repository ids
# 5. merge duplicate rows with the column repository ids
# 6. merge duplicate rows into new data frame
# 7. repeat from 2. until last row in tbl1
newtbl = data.frame(people_id=numeric(),repoCommitted=numeric(),isAuthor=numeric(),repoAuthor=numeric(),commonRepo=numeric())
ntbl1rows<-nrow(tbl1)
tbl2patched<-tbl2[complete.cases(tbl2),]
for(n in 1:ntbl1rows)
{
ndup<-nrow(tbl2patched[tbl2patched$people_id==tbl1$people[n],])
duprow<- tbl1[rep(n,ndup),]
newtbl<-rbind(newtbl,duprow)
}
}
Im stuck at step 5 where it merges the column from tbl2patched of "repository_id" to the newtbl where the id matches. The first data frame looks like this:
people committers repositoryCommitter authors repositoryAuthor
1 1 921 183 896 178
2 2 240 18 209 22
3 3 3 2 28 11
4 4 6548 23 6272 29
5 5 3557 146 3453 146
and so on... until 5923 rows return.
The second data frame:
people_id repository_id
1 1
1 2
1 6
1 7
1 10
and so on till 68709 rows.
The output should look like this: This is what the sample looks like:
people_id committers repoCommitter authors repoAuthors commonRepo
1 1 921 183 896 178 1
2 1 921 183 896 178 2
3 1 921 183 896 178 6
4 1 921 183 896 178 7
5 1 921 183 896 178 10