7

I have the following data:

dat <- data.frame(x               = c("this is my example text", "and here is my other text example", "my other text is short"),
                  some_other_cols = c(1, 2, 2))

Further, I have the following vector of patterns:

my_patterns <- c("my example", "is my", "my other text")

What I want to achieve is to remove any text of my_patterns that occurs in dat$x.

I tried the solution below, but the problem is that as soon as I remove the first pattern from the text (here: "my example"), my solution is not able to detect the occurence of the second (here: "is my") or third pattern anymore.

Wrong solution:

library(tidyverse)
my_patterns_c <- str_c(my_patterns, collapse = "|")

dat_new <- dat %>%
  mutate(short_x = str_replace_all(x, pattern = my_patterns_c, replacement = ""))

I guess I could do sth. like looping through all patterns, collect the string positions in dat$x that match my patterns, then combine them into a range and delete that range from the text. E.g. I add columns to my dat data frame like start_pattern_1 and end_pattern_1 and so on. So for the first row 1 I get 9 (start) and 18 (end) for the first pattern, 6/10 for the second pattern. I then need to check if any end position overlaps with any start position (here start 9 and end 10) and combine them into a range 6-18 and remove this range from the text.

Problem is that I potentially have many new start/end columns then (could be a few hundred patterns in my case) and if I need to pairwise compare the overlapping ranges, my computer will probably crash.

So I'm wondering how I could get it work or how I should best approach this solution. Maybe (and I hope so) there's a better/more elegant/easy solution.

Desired Output of dat would be:

x                                    some_other_cols    short_x
this is my example text              1                  this text
and here is my other text example    2                  and here example
my other text is short               2                  is short

Appreciate your help! Thanks.

Jaap
  • 81,064
  • 34
  • 182
  • 193
deschen
  • 10,012
  • 3
  • 27
  • 50
  • 2
    How do you get `this text`? If *my example* is eliminated as per your first pattern, then you are left with `this is text`which does not have the word *my*...therefore how is *is* eliminated?? – Sotos Jan 28 '20 at 09:32
  • Do you have control over the pattern ? I.e: can it be made some other way to achieve what you want ? – Tensibai Jan 28 '20 at 09:49
  • Or the 1st value needs to be only "this" and 2nd "and here" ? Can you explain your expected output? – Ronak Shah Jan 28 '20 at 09:50
  • 2
    I like this question. It seems to come down to another question: "How can one paste strings together by overlap?". If you can do that, I think you can solve this by creating the concatenated-by-overlap string and adding them to the pattern vector. (so that would have `c("is my example", "is my other text", "my example", "is my", "my other text")`. – Georgery Jan 28 '20 at 10:02
  • 1
    @Sotos: this is exactly my problem, I don't want to go over my dat$x texts sequentially, i.e. I do NOT want to remove the first pattern and then the second pattern from the reamining text. Instead I want to check which patterns occur in the original text, check if there's overlap and if so, remove the "combined" pattern. – deschen Jan 28 '20 at 10:23
  • @Tensibai: no, the pattern is extracted from the text. To be precise: I create trigrams from the text (any three-word text combinations) and want to remove all text occurences that are part of any of these trigrams. – deschen Jan 28 '20 at 10:24
  • @RonakShah: So patterns "my example" and "is my" are part of the first text row. But they overlap. "my" is part of both text patterns. So what I really want, is to remove "is my example", leading to "this text" as the remaining part of the text. – deschen Jan 28 '20 at 10:27
  • Problem with that is how do you tell what order to concatenate? For example, If the first two patterns are TRUE, then how do you concatenate as `is my example` and **NOT** like `my example is`? – Sotos Jan 28 '20 at 10:27
  • No, that won't work, because the overlap can only happen by comparing the end of a string with the beginning of another string. E.g. "my" is the end of "is my" and the beginning of "my example", so that's where you would want to glue them together. The other way around won't work. – deschen Jan 28 '20 at 10:31
  • Will be super complex, but a starting point could be `matches <- lapply(my_patterns,function(x) regexpr(x,dat$x))`, next you'll have to compute start and end for each match (start+length) and see if they overlap to create a single "range" to remove from the string – Tensibai Jan 28 '20 at 10:46
  • Phew. Yep, it's exploding complexity-wise. I started with your suggestion and get as far as appending the start and end positions of each pattern as columns to my data frame. But now I would need to loop through all the start/end columns and do my pairwise comparisons to check if any start position of a pattern lies within the [start, end] range of another pattern and then I would need to extend this old [start, end] range with the new end point from that pattern (if it is larger than the already existing endpoint]. I probably have to give up at this point, since this is beyond my R pay grade. – deschen Jan 28 '20 at 13:30
  • @deschen I'm still playing a bit with it, I'm using two nested lapply calls which already "smells", but there's probably something doable on this side – Tensibai Jan 28 '20 at 13:39
  • @deschen `stringr::str_locate_all()` returns start and end positions of all matches. Overlapping positions [can be collapsed by clever grouping](https://stackoverflow.com/a/52092801/3817004). However, there is another challenge where multiple non-overlapping patterns occur in a string. – Uwe Jan 28 '20 at 18:27

2 Answers2

6

New option with str_locate_all mentionned by Uwe in a comment under the question which greatly simplify the code:

library(stringr)
# Create function to remove matching part of text
# First argument is text, second argument is a list of start and length 
remove_matching_parts <- function(text, positions) {
  if (nrow(positions) == 0) return(text)
  ret <- strsplit(text,"")[[1]]
  lapply(1:nrow(positions), function(x) { ret[ positions[x,1]:positions[x,2] ] <<- NA } )
  paste0(ret[!is.na(ret)],separator="",collapse="")
}

# Loop over the data to apply the pattern
# row = length of vector, columns = length of pattern
matches <- lapply(dat$x, function(x) {
  do.call(rbind,str_locate_all(x, my_patterns)) # transform the list output of str_locate in a table of start/end
})

# Avoid growing a vector in a for loop, create it beforehand, it will be the same length as teh vector we work against
dat$result <- vector("character",length(dat$x))
# Loop on each value to remove the matching parts
for (i in 1:length(dat$x)) {
 dat$result[i] <- remove_matching_parts(as.character(dat$x[i]),matches[[i]])
}

If you have control over the pattern definition and can create it by hand then it can be achieved with a regex solution:

> gsub("(is )?my (other text|example)?","",dat$x)
[1] "this  text"        "and here  example" " is short" 

The idea is to create the pattern with optional parts (the ? after the grouping parentheses.

So we have roughly:

  • (is )? <= optional "is" followed by space
  • my <= literal "my" followed by space
  • (other text|example)? <= Optional text after "my ", either "other text" or (the |) "example"

If you don't have control, things gets messy, I hope I've commented enough for it to be understandable, according to the number of loops included don't expect it to be quick:

# Given datas
dat <- data.frame(x               = c("this is my example text", "and here is my other text example", "my other text is short","yet another text"),
                some_other_cols = c(1, 2, 2, 4))

my_patterns <- c("my example", "is my", "my other text")

# Create function to remove matching part of text
# First argument is text, second argument is a list of start and length 
remove_matching_parts <- function(text, positions) {
  ret <- strsplit(text,"")[[1]]
  lapply(positions, function(x) { ifelse(is.na(x),,ret[ x[1]:x[2] ] <<- NA ) } )
  paste0(ret[!is.na(ret)],separator="",collapse="")
}

# Create the matches between a vector and a pattern
# First argument is the pattern to match, second is the vector of charcaters
match_pat_to_vector <- function(pattern,vector) {
  sapply(regexec(pattern,vector), 
         function(x) {
           if(x>-1) { 
             c(start=as.numeric(x), end=as.numeric(x+attr(x,"match.length")) ) # Create a start/end vector from the index and length of the match
           }
         })
}

# Loop over the patterns to create a dataframe of matches
# row = length of vector, columns = length of pattern
matches <- sapply(my_patterns,match_pat_to_vector,vector=dat$x)

# Avoid growing a vector in a for loop, create it beforehand, it will be the same length as teh vector we work against
dat$result <- vector("character",length(dat$x))
# Loop on each value to remove the matching parts
for (i in 1:length(dat$x)) {
 dat$result[i] <- remove_matching_parts(as.character(dat$x[i]),matches[i,])
}

Result after run:

> dat
                                  x some_other_cols           result
1           this is my example text               1        this text
2 and here is my other text example               2 and here example
3            my other text is short               2         is short
4                  yet another text               4 yet another text
Tensibai
  • 15,557
  • 1
  • 37
  • 57
  • Hm, that works if you can eyeball the overlap and then hard-code it. But what if not? – Georgery Jan 28 '20 at 09:58
  • @Georgery yep, I forgot a "if yes" in my first sentence, it's not a silver bullet for the question, it's just an approach. If the patterns are random that become more tricky and would require working differently with a loop probably – Tensibai Jan 28 '20 at 10:07
  • Exactly, hard coding is not an option, since the patterns can change in ym real-world scenario. And I have much more and much more different patterns. – deschen Jan 28 '20 at 10:28
  • @Georgery taker of review on the updated solution :) – Tensibai Jan 28 '20 at 16:37
  • Impressive. Not that I understand all parts of it, but it gets me already 95% close to the solution. However, @Uwe is right (I've not yet thought of this problem) that it does not work perfectly when any pattern appears more than once in a text (which indeed can happen in my real-life use-case). But thanks anyway, really appreciate the effort and I learned quite a bit today. – deschen Jan 28 '20 at 20:24
  • @Tensibai, I have tested your approach with my enlarged version of `dat` and noticed that _This is myself_ becomes _This elf_. Is it possible you are removing an additional character after the pattern even if it is not a whitespace? – Uwe Jan 29 '20 at 07:24
  • @Uwe sorry, misread the problem, potentially the way I did the range "replacement" could be the culprit, start+length may end up 1 char too far. Handling repeated patterns doesn't work neither indeed. I'm unsure it worth being fixed now – Tensibai Jan 29 '20 at 09:42
  • @Uwe finally, I used str_locate_all (thanks for the hint in the comments under the question) which reduce the complexity and gives correct results – Tensibai Jan 29 '20 at 10:07
  • @Tensibai thanks again for the effort. I have some problems with the `matches <- str_locate_all`, where in my real-life example the mathces list is empty, although the dat$x is a character column and my_patterns is a character vector (dat has 14k rows), my_patterns 46 elements. Not sure why that is and I need to dig deeper into the issue. With the example posted here it indeed does work. FYI: @Uwe's updated solution still works. So not sure why your str_locate_all does not work as intended. – deschen Jan 29 '20 at 14:59
  • @deschen I had the problem with one of the example from Uwe answer (the 4th) which is empty, hence the modified function to return if there's no matches. I had a doubt about your real data, data.frame makes factors by default but I don't have a problem with a direct character column neither... matches should be a list of 3 element list (start/end), empty oif there's no pattern matching a specific row from dat – Tensibai Jan 29 '20 at 15:24
  • I checked what's going on, but am unable to detect why there's a problem with the str_locate_all function. I guess it's due to the fact that you didn't wrap it in the lapply function like @Uwe did. But I'm not sure. I tried with some other data sets and the problem still exists. In addition, if you take the extended toy data set Uwe created, you'll see that the function doesn't correctly delete the occurences of "my example" in rows 6, 8 and 9. – deschen Jan 29 '20 at 20:03
  • @deschen indeed there's something strange going on with str_locate_all, i.e `str_locate_all(dat$x[8], my_patterns)` return the expected matches but `str_locate_all(dat$x, my_patterns)[8]` doesn't. I'll check if I can workaround it – Tensibai Jan 30 '20 at 08:38
  • @deschen, ok I found why, str_locate_all test first parttern against first row, second pattern against second row, and recycle the pattern (4th row get first pattern). I'll fix the problem – Tensibai Jan 30 '20 at 08:43
  • Thanks. Not sure, but it might be as easy as changing to the mentioned lapply extension? `lapply(my_patterns, function(x) str_locate_all(dat$x, x))` – deschen Jan 30 '20 at 09:05
  • @deschen I'm trying to find an approach where I wouldn't have to recompute all intervals after, something along binding the list output from str_locate_all – Tensibai Jan 30 '20 at 09:07
  • @deschen I finally found a way with do.call(rbind) around str_locate_all within a lapply call. I think it works for all cases now – Tensibai Jan 30 '20 at 10:38
  • Wow. This works smoothly and is about 10-20 times faster than Uwe's data.table solution. And this is quite relevant in my work. I tested with a text corpus of 14k elements against 46 patterns to match. However, I regularly have texts with 50k text and maybe 100 or 200 or more patterns. I hope, the computation time increases linearly and not exponentially. Thanks for not giving up and fixing the last remaining bugs. – deschen Jan 30 '20 at 21:31
  • @deschen The time spent will be higher if there's a lot of matches per data, the metric here is the total number of matches which is number of data * number of time each patterns match. Can't guarantee it will be linear – Tensibai Jan 31 '20 at 09:22
  • Well if it is mainly based on the number of matches, then it should be more or less linear (in layman's understanding, maybe a mathematician or IT pro has a different answer). It's not that the amount of computations per additional row or additional pattern increases exponentially (like you would have if you are doing pairwise comparisons), so I'm relatively confident that it will still run in some reasonable time with 50k or so cases. – deschen Jan 31 '20 at 09:55
  • @deschen it's not linear in the way that if you add a new pattern that matches 10 times per row, the number increase by nrow*10 and not just by nrow. It's not really predictable, you should have a better gut feeling knowing your datas :) Just don't add "a" alone as a pattern ? – Tensibai Jan 31 '20 at 09:58
6

There are two crucial points here:

  1. The patterns to remove from a string may overlap
  2. There may be multiple non-overlapping patterns to remove from the string

The solution below tries to address both issues using my favorite tools

library(data.table)
setDT(dat)[, rn := .I] # add row numbers to join on later

library(stringr)
library(magrittr) # piping used to improve readability

pos <- 
  # find start and end positions for each pattern
  lapply(my_patterns, function(pat) str_locate_all(dat$x, pat) %>% 
           lapply(as.data.table) %>% 
           rbindlist(idcol = "rn")) %>% 
  rbindlist() %>% 
  # collapse overlapping positions
  setorder(rn, start, end) %>% 
  .[, grp := cumsum(cummax(shift(end, fill = 0)) < start), by = rn] %>% 
  .[, .(start = min(start), end = max(end)), by = .(rn, grp)]

Now, pos has become:

    rn grp start end
 1:  1   1     6  18
 2:  2   1    10  25
 3:  3   1     1  13
 4:  5   1     6  10
 5:  5   2    24  28
 6:  6   1     1  13
 7:  6   2    15  27
 8:  7   1     3   7
 9:  8   1     1  10
10:  8   2    12  16
11:  8   3    22  34
12:  9   1     1  10
13:  9   2    19  31
# remove patterns from strings from back to front
dat[, short_x := x]
for (g in rev(seq_len(max(pos$grp)))) {
  # update join 
  dat[pos[grp == g], on = .(rn), short_x := `str_sub<-`(short_x, start, end, value = "")]
}
dat[, rn := NULL][   #remove row number
  , short_x := str_squish(short_x)][]   # remove whitespace 
                                             x some_other_cols                          short_x
1:                     this is my example text               1                        this text
2:           and here is my other text example               2                 and here example
3:                      my other text is short               2                         is short
4:                            yet another text               4                 yet another text
5: this is my text where 'is my' appears twice               5 this text where '' appears twice
6:                 my other text is my example               6                                 
7:                                 This myself               7                           Thself
8:          my example is my not my other text               8                              not
9:             my example is not my other text               9                           is not

The code to collapse overlapping positions is modified from this answer.

The intermediate result

lapply(my_patterns, function(pat) str_locate_all(dat$x, pat) %>% 
           lapply(as.data.table) %>% 
           rbindlist(idcol = "rn"))
[[1]]
   rn start end
1:  1     9  18
2:  6    18  27
3:  8     1  10
4:  9     1  10

[[2]]
   rn start end
1:  1     6  10
2:  2    10  14
3:  5     6  10
4:  5    24  28
5:  6    15  19
6:  7     3   7
7:  8    12  16

[[3]]
   rn start end
1:  2    13  25
2:  3     1  13
3:  6     1  13
4:  8    22  34
5:  9    19  31

shows that patterns 1 and 2 overlap in row 1 and patterns 2 and 3 overlap in row 2. Rows 5, 8, and 9 have non-overlapping patterns. Row 7 is to show that patterns are extracted regardless of word boundaries.

EDIT: dplyr version

The OP has mentioned that he/she has "successfully avoided data.table so far". So, I felt challenged to add a dplyr version:

library(dplyr)
library(stringr)

pos <- 
  # find start end end positions for each pattern
  lapply(my_patterns, function(pat) str_locate_all(dat$x, pat) %>% 
           lapply(as_tibble) %>% 
           bind_rows(.id = "rn")) %>% 
  bind_rows() %>% 
  # collapse overlapping positions
  arrange(rn, start, end) %>% 
  group_by(rn) %>% 
  mutate(grp = cumsum(cummax(lag(end, default = 0)) < start)) %>% 
  group_by(rn, grp) %>% 
  summarize(start = min(start), end = max(end))
# remove patterns from strings from back to front
dat <- dat %>%
  mutate(rn = row_number() %>% as.character(),
         short_x = x %>% as.character())
for (g in rev(seq_len(max(pos$grp)))) {
  dat <- dat %>% 
    left_join(pos %>% filter(grp == g), by = "rn") %>% 
    mutate(short_x = ifelse(is.na(grp), short_x, `str_sub<-`(short_x, start, end, value = ""))) %>% 
    select(-grp, -start, -end)
}
# remove row number
dat %>% 
  select(-rn) %>% 
  mutate(short_x = str_squish(short_x))
                                            x some_other_cols                          short_x
1                     this is my example text               1                        this text
2           and here is my other text example               2                 and here example
3                      my other text is short               2                         is short
4                            yet another text               4                 yet another text
5 this is my text where 'is my' appears twice               5 this text where '' appears twice
6                 my other text is my example               6                                 
7                              This is myself               7                        This self
8          my example is my not my other text               8                              not
9             my example is not my other text               9                           is not

The algorithm is essentially the same. However, there are two challenges here where dplyr differs from data.table:

  • dplyr requires explicit coersion from factor to character
  • there is no update join available in dplyr, so the for loop has become more verbose than the data.table counterpart (Perhaps, someone knows a fancy purrr function or a map-reduce trick to accomplish the same?)

EDIT 2

There are some bug fixes and improvements to above codes:

  1. Collapsing positions has been corrected to work also for some edge case I have added to dat.
  2. seq() has been replaced by seq_len().
  3. str_squish() reduces repeated whitespace inside a string and removes whitespace from start and end of a string.

Data

I have added some use cases to test for non-overlapping patterns and complete removal, e.g.:

dat <- data.frame(
  x = c(
    "this is my example text",
    "and here is my other text example",
    "my other text is short",
    "yet another text",
    "this is my text where 'is my' appears twice",
    "my other text is my example",
    "This myself",
    "my example is my not my other text",
    "my example is not my other text"
  ),
  some_other_cols = c(1, 2, 2, 4, 5, 6, 7, 8, 9)
)
my_patterns <- c("my example", "is my", "my other text")
Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134
  • Not that I understand anything of the code (I successfully avoided data.table so far), but it works nicely. Will check tomorrow how it performs with my much larger real data set (where dat contains a few thousand rows and my_patterns contains 100 or 200 or so patterns). – deschen Jan 28 '20 at 20:28
  • From first glance works smoothly. With my 14k data set and comparing with 46 patterns it runs 1-2 mins, but I did not yet found an issues. Just tried the data table solution, now running the tidyverse approach. – deschen Jan 29 '20 at 15:00
  • Just FYI, the data.table solution performs much faster than the tidyverse approach. With the 9 rows in the toy data set, both are still in the same range, but if I extend the data set to e.g. 900 rows, data.table is twice as fast. The advantage might even get bigger with larger data sets or more patterns. – deschen Jan 29 '20 at 21:02