1

data has a column named description of type character() and a column id of type integer() that is set by row_number().

data_map has a column name desc_map of type character() and a column id of type integer() that is set by row_number().

data and data_map do have other columns used in further processing after joining.

The idea of the code below is to use data_map$desc_map as a pattern in str_detect to match data$description. On a match it would add a row to another tibble, using data$id and data_map$id. The resulting matches allows joining together of data and data_map.

library(tidyverse)

data = tribble(
  ~description,
  "19ABB123456",
  "19BCC123456",
  "19CDD123456",
  "19DEE123456",
  "19EFF456789",
  "19FF0056789",
  "19A0A123456",
) %>% mutate(id = row_number())

data_map = tribble(
  ~desc_map,
  "AA",
  "BB",
  "CC",
  "DD",
  "EE",
  "FF",
  "00",
) %>% mutate(id = row_number())

seq_along_rows <- function(.data) {
  seq_len(nrow(.data))
}

matches <- data %>% (function (tbl) {
  m <- tibble(
    row_id = integer(),
    map_id = integer()
  )

  for (i in seq_along_rows(tbl)) {
    row <- tbl[i, ]
    key <- row[["description"]]
    found <- FALSE

    for (j in seq_along_rows(data_map)) {
      map_row <- data_map[j, ]
      pattern <- map_row[["desc_map"]]

      if (str_detect(key, pattern)) {
        m <- add_row(m, row_id = row[["id"]], map_id = map_row[["id"]])
        found <- TRUE
        # allow for finding more than one match
      }
    }

    if (!found) {
      m <- add_row(m, row_id = row[["id"]], map_id = NA)
    }
  }

  return(m)
})

not_unique <- matches %>% 
  group_by(row_id) %>%
  filter(n() > 1) %>%
  ungroup() %>%
  inner_join(data, by = c("row_id" = "id")) %>%
  inner_join(data_map, by = c("map_id" = "id"))

head(not_unique)
#> # A tibble: 2 x 4
#>   row_id map_id description desc_map
#>    <int>  <int> <chr>       <chr>   
#> 1      6      6 19FF0056789 FF      
#> 2      6      7 19FF0056789 00

matches_not_found <- matches %>%
  filter(is.na(map_id)) %>%
  select(-map_id) %>%
  inner_join(data, by = c("row_id" = "id"))

head(matches_not_found)
#> # A tibble: 1 x 2
#>   row_id description
#>    <int> <chr>      
#> 1      7 19A0A123456

matches_found <- matches %>%
  filter(!is.na(map_id)) %>%
  inner_join(data, by = c("row_id" = "id")) %>%
  inner_join(data_map, by = c("map_id" = "id"))

head(matches_found)
#> # A tibble: 6 x 4
#>   row_id map_id description desc_map
#>    <int>  <int> <chr>       <chr>   
#> 1      1      2 19ABB123456 BB      
#> 2      2      3 19BCC123456 CC      
#> 3      3      4 19CDD123456 DD      
#> 4      4      5 19DEE123456 EE      
#> 5      5      6 19EFF456789 FF      
#> 6      6      6 19FF0056789 FF

My question is, can this code be written in a more tidy functional way and what that would look like? If it cannot be done in such a way, what would the reason be?

NebulaFox
  • 7,813
  • 9
  • 47
  • 65
  • 2
    Can you provide some simple example data to get a feeling for in and output? – TimTeaFan Oct 05 '19 at 15:57
  • 2
    You can use the [`reprex`](https://reprex.tidyverse.org/articles/articles/magic-reprex.html) and [`datapasta`](https://cran.r-project.org/web/packages/datapasta/vignettes/how-to-datapasta.html) packages to quickly create a reproducible example so others can help. Please do not use `str()`, `head()` or screenshot. See also [Help me Help you](https://speakerdeck.com/jennybc/reprex-help-me-help-you?slide=5) & [How to make a great R reproducible example?](https://stackoverflow.com/q/5963269) – Tung Oct 05 '19 at 16:30
  • It is not clear from your code where `transactions[j, ]` is coming from. – TimTeaFan Oct 05 '19 at 19:48
  • @TimTeaFan I have fixed the example to clear up that mishap – NebulaFox Oct 06 '19 at 00:11
  • 1
    If you have sufficient RAM, it may be reasonable to simply get the unique vector of strings (`data_map$desc_map`), mutate that as a new column in `data` (put it in `list()`), unnest, add a Boolean column running `str_detect` on `data$description` for each, `spread`/`pivot_wider`, and `filter` as necessary. If you actually care about the ID in `data_map`, you can `left_join` that on at any point, assuming the ID-desc_map mapping is unique, or just use the tibble instead of a vector with the initial mutate. – GenesRus Oct 06 '19 at 08:29
  • @GenesRus: Would be interesting to see your approach in action. – TimTeaFan Oct 06 '19 at 10:22
  • 1
    @TimTeaFan I got some time to update the code example with Tung's feedback. – NebulaFox Oct 06 '19 at 15:44

1 Answers1

3

Update

Based on your updated question here is an updated version of my answer.

This time I just used your inputs as is and did not create a named function. Instead I put everything in one pipe. The column found should indicate how many times a pattern was found, so you should not need different objects as not_unique, matched_not_found, matches_found.

I picked up the idea from GenesRus (in the comments of your question) to create a list-column and unnnest it, but I did not take the approach further using spread/pivot-wider and instead chose map2 to loop over the description and desc_map columns.

library(tidyverse)

data %>% 
  mutate(pattern = list(data_map)) %>% 
  unnest %>% 
  rename(row_id = "id", map_id = "id1") %>% 
  mutate(v = map2_lgl(description, desc_map,
                  ~ str_detect(.x, .y))) %>% 
  group_by(row_id) %>% 
  mutate(found = sum(v),
         desc_map = ifelse(found == F, NA, desc_map),
         map_id = ifelse(found == F, NA, map_id)) %>% 
  filter(v == T | (v == F & found == 0)) %>%
  distinct %>%
  select(-v) 

Old answer

Below is a more tidyverse-based approach which should yield the same result. 'Should' because I can only guess how your input data and expected result looks like. A few notes: (1) I choose normal character vectors as inputs. Row ids are generated on-the-fly. (2) I put your approach into a function called match_tbl. (3) I used tidyverse functions in combination with the pipe-operator. This makes the whole approach easy to read and the appearance seems to be 'tidyverse-ish'. However, when you look into actual functions of tidyverse packages you will see that authors usually refrain from using the pipe operator inside functions, since it can easily throw errors. Use the RStudio debugger on a pipe operation and try to dig deeper into whats going on and you will see it is pretty messy. Therefore, if you want to make a real stable function out of it, drop the pipes and use intermediate variables instead.

Data and packages

library(tidyverse)

# some description data (not a dataframe but a normal char vector)
description <- c("This is a text description",
                "Some words that won't match",
                "Some random text goes here",
                "and some more explanation here")

# patterns that we want to find (not a dataframe but a normal char vector)
pattern <- c("explanation","description", "text")

A function generating the desired output: a match table

# a function which replaces your nested for loop
match_tbl <- function(.string, .pattern) {

  res <- imap(.pattern,
               ~ stringr::str_detect(.string, .x) %>% 
                     tibble::enframe(name = "row_id") %>%
                     dplyr::mutate(map_id = .y) %>% 
                     dplyr::filter(value == T) %>% 
                     dplyr::select(-"value"))

  string_tbl <- .string %>% 
             tibble::enframe(name = "id") %>% 
             dplyr::select("id")

  dplyr::bind_rows(res) %>%
    dplyr::right_join(string_tbl, by = c("row_id" = "id"))

}

Function call and output

match_tbl(description, pattern)
>   row_id map_id
>    <int>  <int>
> 1      1      2
> 2      1      3
> 3      2     NA
> 4      3      3
> 5      4      1
TimTeaFan
  • 17,549
  • 4
  • 18
  • 39
  • 1
    I am just amazed, thank you. Btw, I think there is a bug in your code. When it says `map_id = ifelse(found == F, NA, desc_map))` in the second `mutate` I believe it should be `map_id` not `desc_map`. – NebulaFox Oct 07 '19 at 12:10
  • Yeah, mapping is the better approach! No reason to add extra lines of code where it's not necessary. :) – GenesRus Oct 09 '19 at 23:50