0

I am working on a primitive speech analysis algorithm. Now I want to improve how it handles negations of positive/negative statements. At the moment I add the string "NOT_" only if the negation directly occurs:

s_commentsOut$gsubContent <- gsub("not ","not NOT_",gsub("n't ","n't NOT_",s_commentsOut$lowCo))

So for example

"This is not good"

becomes

"This is not NOT_good"

Now I want to achieve that the "NOT_" is also added when there are n characters in between the vector of target words and the negation, e.g.:

targetList <- c("nice", "perfect", "good", "love")

Now with the help of the above list, the following string:

"This isn't a very good way"

should become

"This isn't a very NOT_good way"

This replacement should only take place if the negation occurs n (for instance 15) characters before the target, e.g. the following should not be converted (because the distance between the target and the negation is > 15):

"This is not going to work. However you did this very nicely."

I found the following SO articles: Negation of several characters before pattern

How to replace a character in a string but only if it occurs within a delimited substring?

But I struggle to get it right. In the meantime I help myself with removing strings like "like ", "an ", "a " from the text...

Further Testphrases:

"Nottingham is the love of my life."

"This is good. Nottingham is a town."

"This is not very good"

"This is not good. This is not good. This is not very good. This is nice. This very nice. This is not very nice."

Community
  • 1
  • 1
florian
  • 604
  • 8
  • 31
  • where does the n character thing come in? `ifelse(grepl('not|n\'t', x), gsub(sprintf("(?=%s)", paste(targetList, collapse = '|')), "NOT_", x, perl = TRUE), x)` – rawr Oct 28 '16 at 17:16
  • Thank you @rawr - I edited the above post to make it more clear. – florian Oct 28 '16 at 18:12

2 Answers2

1

This should work (updated with n)

library(stringr)
negation=function(x,n)
{
  target=c("nice", "perfect", "good")
  negate=c("not ","n't")
  out=x
  a=as.data.frame(str_locate(x,negate))
  negate_end=as.numeric(a[!is.na(a$end),]$end)
  b=as.data.frame(str_locate(x,target))
  target_start=as.numeric(b[!is.na(b$start),]$start)
  distance=target_start-negate_end
  distance=ifelse(length(distance)==0,9999999,distance)
  if(sum(!is.na(str_match(x,target)))>0 & distance<=n & distance>=0)
    out=str_replace_all(x,target,paste("NOT_",target,sep=''))[which(!is.na(str_match(x,target)))]
  return(out)
}
  • where do you define `n`? With `n` I mean the amount of letters a good would be negated in the presence of a negation like `not` or `n't`. – florian Oct 28 '16 at 17:52
  • Sorry florian for missing n earlier..check it now it should work – Vishal Jaiswal Oct 29 '16 at 09:30
  • Thank you we are getting close, but we are not quite there yet. Atm it seems to take into account negations in 15 characters vicinity, however it should only account for preceding negations. Also words like Nottingham should not trigger it: Test: `negation("This is good. Nottingham is a town.", 15)` returns: `"This is NOT_good. Nottingham is a town."` – florian Oct 30 '16 at 17:02
  • Thanks Florian. Its actually not identifying not in Nottingham but was rather working on negating even when there was no negative words present. I have corrected the script to take care of it now. Also if you could provide me with few more test inputs and outputs, then I can finetune it (in case the updated script does not work) – Vishal Jaiswal Oct 31 '16 at 09:52
  • Thank you...hmm now I try to apply it to my vector with 500k text elements and I receive the following warning: `s_commentsOut$gsubContent <- lapply(s_commentsOut$lowCo, function(x) { negation(x, 15) })` `Warning messages: 1: In target_start - negate_end : longer object length is not a multiple of shorter object length` – florian Nov 01 '16 at 10:15
  • This seems to happen only in a few cases as otherwise the algorithm seems to work just fune – florian Nov 01 '16 at 10:16
  • ahh found another bug..try it with the following test phrase: `negation("This is not good. This is not good. This is not very good. This is nice. This very nice. This is not very nice.", 15)` – florian Nov 01 '16 at 10:23
0

You could try the following: (please do test because I am not 100% sure)

require(stringr)
negate <- function(word, phrase, distance_allowed){

  not_pos <- str_locate(tolower(phrase), "^not |not$| not ")

  if (!is.na(not_pos[1])){

      word_pos <- str_locate(tolower(phrase), word)

      if(!is.na(word_pos[1])){

          neg_dist <- ifelse(word_pos[1] > not_pos[1], word_pos[2] - not_pos[1], not_pos[2] - word_pos[1])

        if(neg_dist < distance_allowed ){

             phrase <- gsub(word, paste0("NOT_", word), phrase)

        }


      }

  }
      return(phrase)

}

My humble logic is the following:

  1. Find the not in the phrase (it either starts the phrase, is between words, or finishes it, just to avoid words like nothing since I am not so good with the pesky regular expressions)

  2. If the not is there find the position of the word, if the word is found then calculate the distance between the not position and the word (if the word is before not then calculate the distance between the end of the word and the start of not otherwise end of not start of the word)

  3. If this distance is smaller than the one you allow (in your case n = 15) do the change

Please test it! Hope it helps

User2321
  • 2,952
  • 23
  • 46
  • Thank you, unfortunately I could not get it to run with my primitive test: http://pastebin.com/6ExjHLEL – florian Oct 29 '16 at 07:56