0

I have a dataset of constructive comments and want to remove a list of common positive comments found in a csv at an early stage of analysis.

The original dataset looks similar to this:

  df <-
  data.frame(
    "SuveyID" = 1:10,
    "NI" = c(
      "too many quizs",
      "very vague and conflicting instructions sometimes",
      "way too many emails hard to keep up",
      "technology issue",
      "all is good",
      "all perfect",
      "no improvements",
      "sometimes goes off topic",
      "connection issues of internet",
      "all is well"
    )
  )

The list I need to remove looks similar to this, importantly this list come from a csv:

remove <-
  data.frame(
    "Strings.to.replace.with.NA" = c(
      "all is good", 
      "all is well", 
      "all perfect")
    )

Where a string in the remove dataset appears in the NI dataset, I would like to replace it with NA.

The problem I appear to be having is with collapse"|" across the records in the csv. I cant seem to get it to work. I have tried multiple versions of str_replace_all, str_replace, stri_detect_regex. But I dont have the pattern correct with collapse "|".

Help is greatly appreciated as always.

Keelin
  • 367
  • 1
  • 10
  • 2
    Do you just need `df$NI[df$NI %in% remove$Strings.to.replace.with.NA] <- NA`? See [R demo](https://rextester.com/HGD72227). See [this answer](https://stackoverflow.com/questions/32239581/replacing-values-in-a-column-with-another-column-r) – Wiktor Stribiżew Apr 30 '20 at 21:13
  • 1
    Yes this appears to work wonderfully!!! Gosh that was fast and greatly appreciated @WiktorStribiżew – Keelin Apr 30 '20 at 21:21
  • My and akrun's solutions cannot work for you both, only one can really work, please let me know the answer for my question above. – Wiktor Stribiżew Apr 30 '20 at 21:32

1 Answers1

1

We can concatenate the 'remove' elements to a single string using paste with collapse="|" and use that in gsub (base R)

df$NI <- gsub(paste0("\\b(", paste(remove[[1]], collapse="|"), ")\\b"), "", df$NI)
df$NI
#[1] "too many quizs"                                    "very vague and conflicting instructions sometimes"
#[3] "way too many emails hard to keep up"               "technology issue"                                 
#[5] ""                                                  ""                                                 
#[7] "no improvements"                                   "sometimes goes off topic"                         
#[9] "connection issues of internet"                     ""                        

Or using str_remove_all with str_c

library(stringr)
str_remove_all(df$NI, str_c("\\b(", str_c(remove[[1]], collapse="|"), ")\\b"))
akrun
  • 874,273
  • 37
  • 540
  • 662