3

I want to remove all characters which doesn't match a string pattern using stringr package. So far I've been able to remove those before the pattern using "\\w+(?= (grape|satsuma))" as pattern but remove those after the pattern is still imposible.

> str_remove_all("apples grape banana melon olive persimon grape apples satsuma papaya", 
+                "\\w+(?= (grape|satsuma))")
[1] " grape banana melon olive  grape  satsuma papaya"

The desired result is:

"grape grape satsuma"

(NOTE: I am aware the easiest approach in this case is to extract only "grape" and "satsuma" but for analysis purposes I prefer this way)

Edited providing the entire problem

The entire problem is as follow, given a d data frame which contains a column with a string the function should return the same column only with matches:

> d
# A tibble: 2 x 2
  string_column                  c2
  <chr>                       <dbl>
1 apples grape banana satsuma     3
2 grape banana satsuma melon      4

Using the answer provided by @d.r works:

> d %>% 
+   mutate_at(vars(string_column), ~ gsub("(grape|satsuma| )(*SKIP)(*FAIL)|.", "", ., perl = TRUE))

# A tibble: 2 x 2
  string_column        c2
  <chr>             <dbl>
1 " grape  satsuma"     3
2 "grape  satsuma "     4

All answers provided so far using stringr package fail returning the string_column

This the dput for d:

d <- structure(list(string_column = c("apples grape banana satsuma", 
"grape banana satsuma melon"), c2 = c(3, 4)), row.names = c(NA, 
-2L), class = c("tbl_df", "tbl", "data.frame"))
Tito Sanz
  • 1,280
  • 1
  • 16
  • 33
  • 2
    `gsub("(grape|satsuma| )(*SKIP)(*FAIL)|.", "", "apples grape banana melon olive persimon grape apples satsuma papaya", perl = TRUE)` – d.b May 15 '18 at 14:43
  • @d.b yeah! But I want to use `stringr` package, any idea? – Tito Sanz May 15 '18 at 14:46
  • Using `str_remove_all` with `"\\w+(?= (grape|satsuma))"` as pattern remove words that are before `grape` or `satsuma`. My desire result is that `str_remove_all` erase everything that doesn't match `papaya` or `satsuma`, so the desire result in this case is: `"grape grape satsuma"`. Please, let me know whether the purpose is not clear enough. – Tito Sanz May 15 '18 at 15:17

1 Answers1

4

You may want to look at negative lookaheads and some related regex techniques in the linked thread.

However, since we are extracting words I'd rather use str_extract_all and I'd do it like this:

str_extract_all("apples grape banana melon olive persimon grape apples satsuma papaya", 
                               "grape|satsuma")
 "grape"   "grape"   "satsuma"

I also really like this line that @steveLangsford left in a comment:

paste0(unlist(str_extract_all("apples grape banana melon olive persimon grape apples satsuma papaya", "grape|satsuma")), collapse=" ") 
"grape grape satsuma"

Taking it a little bit further based on our discussion/comments:

string_column <- c("apples grape banana satsuma", "grape banana satsuma melon") 
c2            <- c(3, 4) 
d             <- tibble(string_column,c2) 

myfun <- function(x) {paste0(unlist(str_extract_all(x, "grape|satsuma")), collapse=" ") }

sapply(d$string_column, myfun)
        "grape satsuma"             "grape satsuma"
Hack-R
  • 22,422
  • 14
  • 75
  • 131
  • Using `str_extract_all` solves the problem with the regex expression. However I need to apply this in a column of a data frame so I need that the expression returns only a single string. After trying all around with `str_extract_all` and `str_c` with `map_df ` I assumed the easiest way might be trying to use inverse matching in the regexpression, but it also seems impossible. I've tried several regexpression being `"\\w+(?= (grape|satsuma))"` the closest to my purpose. – Tito Sanz May 15 '18 at 14:44
  • @TitoSanz I see. Let me work on it a bit more over lunch and see if I can add some things to that end. – Hack-R May 15 '18 at 14:46
  • 1
    paste0(unlist(str_extract_all("apples grape banana melon olive persimon grape apples satsuma papaya", "grape|satsuma")), collapse=" ") – steveLangsford May 15 '18 at 14:47
  • Let's assume the @steveLangsford 's approach, when I tried to apply along a column in a data frame this return an error: – Tito Sanz May 15 '18 at 15:18
  • @TitoSanz please paste the error maybe it can be applied different. Also the code where you applied it would be good. Meantime I'm looking at an additional approach. – Hack-R May 15 '18 at 15:21
  • I tried to apply along a column in a data frame and returned an error: 'Column `string_column` must be length 2 (the number of rows) or one, not 4'. The code used is: `string_column <- c("apples grape banana satsuma", "grape banana satsuma melon") c2 <- c(3, 4) d <- tibble(string_column,c2) d %>% mutate_at(vars(string_column), function(x) paste0(unlist(str_extract_all(x, pattern = "grape|satsuma")))) ` – Tito Sanz May 15 '18 at 15:27
  • @TitoSanz `sapply` works; I just made a related edit – Hack-R May 15 '18 at 15:35
  • @Hack-R note that this approach doesn't return two rows of a data frame, instead the results are transposed and splitted. First column in both rows must be: `"grape satsuma"`. This is the reason why using `mutate_at` returns that error. – Tito Sanz May 15 '18 at 15:46
  • Also note that the answer provided by @d.b works using: `d %>% mutate_at(vars(string_column), ~gsub("(grape|satsuma| )(*SKIP)(*FAIL)|.", "", ., perl = TRUE))` – Tito Sanz May 15 '18 at 15:56
  • @TitoSanz with regards to the first comment if you want to transpose it's just `t()` But it sounds like you have a solution with the edit to your code from db so that should be good. – Hack-R May 15 '18 at 16:19