1

I'm trying to up my R game, and I clearly need some guidance. I wanna create a lot of variables (93, to be exact), but I wanna do that the smart way. But I'm stuck.

My problem: a dataframe (df) containing some variables, including the "main" one, which contains the stems of my description variable. Another dataframe (reference), more of a reference table, containing two columns - the category and the regex necessary to identify it; I kept only 3 entries, but its 93 originally.

The code:

library(tidyverse)

df <- tibble("FlawType" = c(rep("Medium", 5), rep("Major", 5)),
         "Description" = c("utilizaca indev equip final divers daquel justific aquisica",
                           "utilizaca modal indev licitac aquisica mater previst plan trabalh conveni nomd",
                           "aquisica indev lanch gener alimentici secret municip educaca mont r",
                           "uso indev recurs bloc atenca basic aquisica medic realizaca trat intim prefeit decisa judic",
                           "indici irregular favorec process licitato no aquisica medic farmac basic raza concentraca indevid empr certam",
                           "localizaca bem vist realiz equip fiscalizaca cgu escol municip abril municipi palestin par",
                           "telecentr inat ausenc equip local instalaca equip defeit",
                           "equip local",
                           "equip mater permanent adquir implantaca banc aliment send utiliz outr local simples encontr in loc realiz equip",
                           "mater equip gener alimentici adquir recurs cra por entreg local atend"))

reference <- tibble(var = c("Aquisição indevida", "Equipamentos não localizados", "Despesa irregular"),
                    regex = c("(aquisica.*indev|indev.*aquisica)", "(equip.*local|local.*equip)", "(desp.*irregul|irregul.*desp)"))

I kinda can create three new variables in my sample df, but it turns out to be a list, and I have to extract it. I thought it wouldn't be a problem, but when I try to run it my original df (60k+ lines), it gets stuck...

The idea is: use the reference$var as the name of each new variable, using the associated regex (reference$regex) to create a dummy for every entry in the reference.

Code that works in the sample but not in the original df, just for reference:

varnames <- unique(reference$var)

for(varname in varnames){

  fd[[varname]] <- df %>% 
    mutate(!!paste0(varname) := ifelse(str_detect(df$Description, reference$regex), 1, 0))

}

df <- bind_cols(df, map_df(fd,3))

Thanks in advance.

GVianaF
  • 59
  • 7

1 Answers1

2

There's probably a more elegant way to do this (I'm not a huge fan of having to use bind_cols at the end to bring back the original variables), but this should work:

add_vars <- function(df, x, y) {
  x <- quo_name(x)
  transmute(df, !! x := ifelse(str_detect(Description, y), 1, 0))
}

bind_cols(df, map2_dfc(reference$var, reference$regex, ~ add_vars(df, .x, .y)))

# A tibble: 10 x 5
   FlawType Description                                                 `Aquisição indevi~ `Equipamentos não loc~ `Despesa irregul~
   <chr>    <chr>                                                                    <dbl>                  <dbl>             <dbl>
 1 Medium   utilizaca indev equip final divers daquel justific aquisica                  1                      0                 0
 2 Medium   utilizaca modal indev licitac aquisica mater previst plan ~                  1                      0                 0
 3 Medium   aquisica indev lanch gener alimentici secret municip educa~                  1                      0                 0
 4 Medium   uso indev recurs bloc atenca basic aquisica medic realizac~                  1                      0                 0
 5 Medium   indici irregular favorec process licitato no aquisica medi~                  1                      0                 0
 6 Major    localizaca bem vist realiz equip fiscalizaca cgu escol mun~                  0                      1                 0
 7 Major    telecentr inat ausenc equip local instalaca equip defeit                     0                      1                 0
 8 Major    equip local                                                                  0                      1                 0
 9 Major    equip mater permanent adquir implantaca banc aliment send ~                  0                      1                 0
10 Major    mater equip gener alimentici adquir recurs cra por entreg ~                  0                      1                 0
Phil
  • 7,287
  • 3
  • 36
  • 66