Make dataframe with mapping between two other dataframes

Question

I'm writing a function to create a mapping between commits and Jira issues. Not delving into details, it gets two dataframes - one containing commit hashes and commit messages, second containing Jira issues. Third argument is a regex which tells the function how to map those.
Result should be dataframe with two columns - commit and issue, containing hash and issue key of any mapping found. Duplicates (e.g. the same issue key twice) should be listed as well.

Basing on this question, I managed to do it the wrong way using nested loops and a matrix of logicals saying whether the mapping is there or not:

connect_commits_to_issues <- function(commit_data, issue_data, regex) {
    extracted <- commit_data %$% msg %>% str_extract_all(regex) %>% as.vector()
    map <- sapply(extracted, function(commit) {
        apply(issue_data, 1, function(r) any(r == commit))
    }) %>% t()

    result <- data.frame(commit = character(0), issue = character(0), stringsAsFactors = F)
    for (i in 1:nrow(commit_data)) {
        for (j in 1:nrow(issue_data)) {
            if (map[i, j]) {
                result[nrow(result) + 1,] <- list(commit = commit_data$commit[i],
                                                  issue = issue_data$key[j])
            }
        }
    }

    result
}

Example of usage:

library('tidyverse')
valid_jira_df <- data.frame(key = c("ISSUE-13", "ISSUE-169"),
                            stringsAsFactors = FALSE)
valid_commit_df <- data.frame(commit = c("A", "B", "C"),
                              msg = c("ISSUE-13 Fix", "new feature", "Another ISSUE-13 fix"),
                              stringsAsFactors = FALSE)
result <- connect_commits_to_issues(valid_commit_df, valid_jira_df, "(ISSUE-\\d+)")

str(result)
#'data.frame':  2 obs. of  2 variables:
#$ commit: chr  "A" "C"
#$ issue : chr  "ISSUE-13" "ISSUE-13"

I know this solution is very un-R-ish. Can it be done in a smarter (and vectorized) way?

How do you run `connect_commits_to_issues` ? If I do `connect_commits_to_issues(valid_commit_df, valid_jira_df, "(ISSUE-\\d+)")` it returns an error `Error in x[[jj]][iseq] <- vjj : replacement has length zero `. I think you can remove the `test_that ` check for the purpose of this question as it is not needed. — Ronak Shah, Apr 11 '20 at 08:09
@RonakShah I left `test_that` to show expected results as well, but you are right - it's not needed. Error is a typo, fixed now (I have some more cleaning of data in my function, tried to change it into minimal example here) — Yksisarvinen, Apr 11 '20 at 21:13
Why do you need `valid_jira_df`, you could do `valid_commit_df %>% filter(str_detect(msg, 'ISSUE-\\d+'))` ? — Ronak Shah, Apr 12 '20 at 14:15

Make dataframe with mapping between two other dataframes

0 Answers0