0

I need help in removing text using R.

Below is the file that I have:

Name       Content
Re: fff    . Re: fff . I am a boy. She is girl...
GGOO       Laagg. jaja.
Re: QWE    . Re: QWE . I am pretty.

I would like to convert the file into the output below:

Name       Content
Re: fff    I am a boy. She is girl...
GGOO       Laagg. jaja.
Re: QWE    I am pretty.

Basically is to remove the text from the Content column if it matches with the text in the Name column using R.

I tried using gsub but it doesn't work. Below is the code I tried:

r <- gsub (df$Name, "", df$Content)

Thank you in advance.

poppp
  • 331
  • 2
  • 3
  • 14
  • take a look at this ; http://stackoverflow.com/questions/19424709/r-gsub-pattern-vector-and-replacement-vector – scoa Sep 10 '15 at 08:00
  • What do you want to do with the spaces and dots around your pattern? – David Arenburg Sep 10 '15 at 08:03
  • @DavidArenburg I would like to remove them, but would like to keep the dots for the sentences after extracting the title in Content column – poppp Sep 10 '15 at 08:13
  • If any of the answers was useful for you, you should probably upvote/accept in order to provide some reward/feedback for the effort of all the people in this thread. – David Arenburg Sep 17 '15 at 07:00

4 Answers4

2

This worked for me:

df$Result <- mapply(gsub, pattern = df$Name, replacement = "", x = df$Content)

The problem with gsub is that it only accepts one pattern. So to make it work with a particular, individual pattern for x '?mapply?` is the tool of choice in base R.

Kirill
  • 391
  • 1
  • 7
1

You can also use the stringi package who has a vectorized and very efficient stri_replace_first_fixed function for that

library(stringi)
stri_replace_first_fixed(df$Content, df$Name, "")
## [1] ".  . I am a boy. She is girl..." "Laagg. jaja."  ".  . I am pretty."  

Edit: As per OPs comment, if there are possible spaces, you would need to build a regular expression (similar as in the other answer) and use stri_replace_first_regex instead

stri_replace_first_regex(df$Content, paste0("(\\.\\s+)?", df$Name, "(\\s+\\.\\s+)?"), "")
## [1] "I am a boy. She is girl..." "Laagg. jaja." "I am pretty."              
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
1

Data

d <-structure(list(Name = c("Re: fff", "GGOO", "Re: QWE"), 
                   Content = c(". Re: fff . I am a boy. She is girl...",
                               "Laagg. jaja.", ". Re: QWE . I am pretty.")),
              .Names = c("Name", "Content"), 
              row.names = c(NA, -3L), class = "data.frame")

Code

apply(d, 1, function(row) {
   reg <- row[1]
   reg <- paste("\\.[[:space:]]*", reg, 
                "[[:space::]*\\.[[:space::]]*", sep = "")
   gsub(reg, "", row[2])
})

# [1] "I am a boy. She is girl..." "Laagg. jaja."              
# [3] "I am pretty." 

Explanation

gsub is not vectorized, that is it cannot deal with a vector appropriately. Hence, you have to loop over all entries of your data frame. I ammended the regex that it captures also the dot and the spaces.

thothal
  • 16,690
  • 3
  • 36
  • 71
  • This could be nice solution for the additional spaces but it won't work when there aren't any. For example, if you"ll replace `Name` at second row to `Laagg.`, it won't match. – David Arenburg Sep 10 '15 at 08:08
1

Here's another option, using Map and gsub:

df$Content <- Map(gsub, df$Name, "", df$Content)
#     Name                         Content
#1 Re: fff .  . I am a boy. She is girl...
#2    GGOO                    Laagg. jaja.
#3 Re: QWE               .  . I am pretty.

Considering that the names seem to be always enclosed by a leading and ending period, separated with a single white space, and that the OP stated that these periods should be removed, the result could be improved with:

df$Content <- Map(gsub,paste(".", df$Name, "."),"", df$Content)
#     Name                     Content
#1 Re: fff  I am a boy. She is girl...
#2    GGOO                Laagg. jaja.
#3 Re: QWE                I am pretty.

However, this only works for patterns of the type ". name ."

RHertel
  • 23,412
  • 5
  • 38
  • 64
  • `Map` is the same as `mapply` with `SIMPLIFY = FALSE`. It means that your solution will return a list column, which will be harder to use afterwards (just saying- no intention to fight over this :)) – David Arenburg Sep 10 '15 at 08:21
  • Yes, it returns a list. If necessary, the column can be treated with `unlist()` afterwards, or use `mapply` as you suggest. There are often similar solutions. This, as I said, is just one option. I don't claim that it's better, faster, or superior to other solutions in any way. – RHertel Sep 10 '15 at 08:29
  • I think it's useful to know about different possibilities. For example, there are many similar ways to accomplish what `aggregate()` can do in some cases, and it may help to be aware of these different options. – RHertel Sep 10 '15 at 08:35
  • Ok, This is will be my last comment on this. But if you look at the source code of `Map` it is just `mapply`- not something else. Your solution, although prints nice, gives a worse result though than just `mapply` because it returns a list. If the other guy would post a `Map` solution and you would post a `mapply` as an alternative, I could agree as it would have been an improvement, though IMO it's should be a comment as they are the same. This is nothing like comparing `aggregate` and `tapply` which are completely different – David Arenburg Sep 10 '15 at 08:42
  • There are different opinions on this. Here's a quote from H. Wickham's book "Advanced R": "You may be more familiar with mapply() than Map(). I prefer Map() because: • It’s equivalent to mapply with simplify = FALSE , which is almost always what you want. • Instead of using an anonymous function to provide constant inputs, mapply has the MoreArgs argument that takes a list of extra arguments that will be supplied, as is, to each call. This breaks R’s usual lazy evaluation semantics, and is inconsistent with other functions. In brief, mapply() adds more complication for little gain."(end quote) – RHertel Sep 10 '15 at 08:56