-1

I have one data frame of length 5923 and the second data frame of length 68709. The first data frame looks like this and the second data frame looks like this

Their common column is the first column "people_id".

So far i've done this:

#
# This R function merges two tables and returns the resulting table in a new data frame.
    # inputs
# 1. tbl1 loaded from a csv file.
# 2. tbl2 is output from an query containing people_id and repository_id
# There can be multiple repository ids associated to each people id
#
mergetbl <- function(tbl1, tbl2)
{
  # tbl1 -- from csv file
  # tbl2 -- from sql query
  # 1. create an empty data frame
  # 2. go through tbl1 row by row
  # 3. for each row in tbl1, look at the current people_id in tbl2 and extract all associated repository_id
  # 4. duplicate the same row in tbl1 the same number of times there are associated repository ids
  # 5. merge duplicate rows with the column repository ids
  # 6. merge duplicate rows into new data frame
  # 7. repeat from 2. until last row in tbl1
  newtbl = data.frame(people_id=numeric(),repoCommitted=numeric(),isAuthor=numeric(),repoAuthor=numeric(),commonRepo=numeric())

  ntbl1rows<-nrow(tbl1)
  tbl2patched<-tbl2[complete.cases(tbl2),]
  for(n in 1:ntbl1rows)
  {
    ndup<-nrow(tbl2patched[tbl2patched$people_id==tbl1$people[n],])
    duprow<- tbl1[rep(n,ndup),]
    newtbl<-rbind(newtbl,duprow)


  }
}

Im stuck at step 5 where it merges the column from tbl2patched of "repository_id" to the newtbl where the id matches. The first data frame looks like this:

    people  committers  repositoryCommitter authors repositoryAuthor
 1  1       921         183                896      178
 2  2       240         18                 209      22
 3  3       3           2                  28       11
 4  4       6548        23                 6272     29
 5  5       3557        146                3453     146

and so on... until 5923 rows return.

The second data frame:

    people_id repository_id
    1           1
    1           2
    1           6
    1           7
    1           10

and so on till 68709 rows.

The output should look like this: This is what the sample looks like:

    people_id committers   repoCommitter authors   repoAuthors  commonRepo
1    1        921          183            896       178           1
2    1        921          183            896       178           2
3    1        921          183            896       178           6
4    1        921          183            896       178           7
5    1        921          183            896       178           10
  • @Pascal there should be a duplication between column 2-5 in people.csv – StudentDoubts Jun 18 '15 at 05:10
  • 3
    Is there a reason you are not using the `merge()` function? It would help to make a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) that you can include in the question itself (without linking to external datasets). Be sure to give the desired output for the sample input. – MrFlick Jun 18 '15 at 05:11
  • I don't know, I just suggest it is a **possible** duplicate. –  Jun 18 '15 at 05:11
  • @MrFlick because the common column from both data frame has the same ids (1-5923) but the second data frame has a duplication of these ids which causing the rows to return many rows – StudentDoubts Jun 18 '15 at 05:15
  • 1
    It sounds like you are trying to do a one-to-many join in the code so I don't see how that's different. Again, a smaller, reproducible example with desired output would be very helpful. – MrFlick Jun 18 '15 at 05:19
  • @MrFlick how do i write it in the question? Sorry, I'm new here – StudentDoubts Jun 18 '15 at 05:20

1 Answers1

2

I loaded your data, recognizing that the first column in each CSV file clearly contains row names:

people <- read.csv('people.csv',row.names=1);
peoplePerRepo <- read.csv('peoplePerRepo.csv',row.names=1);

Sample of the resulting data.frames:

head(people);
##   people committers repositoryCommitter authors repositoryAuthor
## 1      1        921                 183     896              178
## 2      2        240                  18     209               22
## 3      3          3                   2      28               11
## 4      4       6548                  23    6272               29
## 5      5       3557                 146    3453              146
## 6      6        445                  55     444               55
head(peoplePerRepo);
##   people_id repository_id
## 1         1             1
## 2         1             2
## 3         1             6
## 4         1             7
## 5         1            10
## 6         1            11

One detail we should notice here is that there is an inconsistency in the key column name: people$people vs. peoplePerRepo$people_id. Nothing we can't handle, as we'll see.

I investigated the data for my own benefit, but I'll include some results here to ensure we're on the same page. First, the row counts:

nrow(people);
## [1] 5923
nrow(peoplePerRepo);
## [1] 72179

So your claim that the first data.frame is of length 5923 is confirmed, but the second is longer than your claim of 68709: it's 72179. I checked if maybe there was some duplication of rows, but there does not appear to be:

nrow(unique(people));
## [1] 5923
nrow(unique(peoplePerRepo));
## [1] 72179

So we have two data.frames with 5923 and 72179 unique rows. Looking into the keys:

range(people$people);
## [1]    1 5923
setdiff(1:5923,people$people);
## integer(0)
range(peoplePerRepo$people_id);
## [1]    1 5923
setdiff(1:5923,peoplePerRepo$people_id);
## integer(0)

The above proves that the key column in each data.frame contains only values in 1:5923, and every value in that range is represented at least once in both tables. And because people$people is exactly 5923 elements long, we know each value in 1:5923 must be represented exactly once.

range(table(people$people));
## [1] 1 1
range(table(peoplePerRepo$people_id));
## [1]   1 466

The first of the above two statements reconfirms what I just said, namely, that in people$people every value in 1:5923 is represented exactly once. The second statement shows that in peoplePerRepo$people_id the values 1:5923 have frequencies ranging from 1 to 466. So this is definitely a one-to-many relationship. You can examine the exact frequencies by omitting the range() call, IOW just run table(peoplePerRepo$people_id), but the output is verbose and I won't include it here.

Finally, it's always good to check for NAs in key columns. We can already deduce that there can't be any NAs in people$people, because it contains exactly the set 1:5923, but we should at least check peoplePerRepo$people_id:

sum(is.na(people$people));
## [1] 0
sum(is.na(peoplePerRepo$people_id));
## [1] 0

So, no NAs in the key columns.

Lastly, take a look at the summary() function, which can be convenient (in general) to get some quick summary statistics of a vector or all columns of a data.frame.

summary(people);
##      people       committers      repositoryCommitter    authors        repositoryAuthor
##  Min.   :   1   Min.   :    0.0   Min.   :  0.0       Min.   :    0.0   Min.   :  0.00
##  1st Qu.:1482   1st Qu.:    0.0   1st Qu.:  0.0       1st Qu.:    2.0   1st Qu.:  1.00
##  Median :2962   Median :    0.0   Median :  0.0       Median :    6.0   Median :  2.00
##  Mean   :2962   Mean   :  200.0   Mean   : 11.6       Mean   :  198.2   Mean   : 14.06
##  3rd Qu.:4442   3rd Qu.:   39.5   3rd Qu.:  3.0       3rd Qu.:   63.0   3rd Qu.:  8.00
##  Max.   :5923   Max.   :15959.0   Max.   :466.0       Max.   :15938.0   Max.   :465.00
summary(peoplePerRepo);
##    people_id      repository_id
##  Min.   :   1.0   Min.   :   1.0
##  1st Qu.: 151.0   1st Qu.: 114.0
##  Median : 459.0   Median : 224.0
##  Mean   : 938.2   Mean   : 513.8
##  3rd Qu.:1147.0   3rd Qu.:1045.0
##  Max.   :5923.0   Max.   :1418.0
##                   NA's   :3470

So, based on all of the above, we can accomplish your requirement with a single merge() call:

output <- merge(people,peoplePerRepo,by.x='people',by.y='people_id');
nrow(output);
## [1] 72179
head(output);
##   people committers repositoryCommitter authors repositoryAuthor repository_id
## 1      1        921                 183     896              178             1
## 2      1        921                 183     896              178             2
## 3      1        921                 183     896              178             6
## 4      1        921                 183     896              178             7
## 5      1        921                 183     896              178            10
## 6      1        921                 183     896              178            11

The output row count 72179 makes sense. Since neither key column contains any NAs, and the key in every row of peoplePerRepo matches exactly one key in people, all 72179 successfully join with exactly one row of people. And you can see that the actual data in this sample matches your expected output.

One final (minor) point: the output column names do not exactly match your expected output column names. This could be fixed by assigning the entire vector of column names from scratch, or by selectively replacing whichever column names you want to change. Here I'll demonstrate the latter approach:

names(output)[names(output)=='people'] <- 'people_id';
names(output)[names(output)=='repositoryCommitter'] <- 'repoCommitter';
names(output)[names(output)=='repositoryAuthor'] <- 'repoAuthors';
names(output)[names(output)=='repository_id'] <- 'commonRepo';
head(output);
##   people_id committers repoCommitter authors repoAuthors commonRepo
## 1         1        921           183     896         178          1
## 2         1        921           183     896         178          2
## 3         1        921           183     896         178          6
## 4         1        921           183     896         178          7
## 5         1        921           183     896         178         10
## 6         1        921           183     896         178         11
bgoldst
  • 34,190
  • 6
  • 38
  • 64