4

Let's assume I have the following sentence:


s = c("I don't want to remove punctuation for negations. Instead, I want to remove only general punctuation. For example, keep I wouldn't like it but remove Inter's fan or Man city's fan.")

I would like to have the following outcome:

"I don't want to remove punctuation for negations Instead I want to remove only general punctuation For example keep I wouldn't like it but remove Inter fan or Man city fan."

At the moment if I simply use the code below, I remove both 's and ' in the negations.


  s %>%  str_replace_all("['']s\\b|[^[:alnum:][:blank:]@_]"," ")

 "I don t want to remove punctuation for negations  Instead  I want to remove only general punctuation           For example  keep I wouldn t like it but remove Inter  fan or Man city  fan "

To sum up, I need to have a code that removes general punctuation, including "'s" except for negations that I want to keep in their raw format.

Can anyone help me ?

Thanks!

zx8754
  • 52,746
  • 12
  • 114
  • 209
Rollo99
  • 1,601
  • 7
  • 15
  • Negation is always(?) `"'t"`, maybe just remove `"'s"` with a fixed match? – zx8754 Sep 29 '21 at 08:32
  • The issue with that is that I still need to clear the general punctuation. Whichever cleaning strategy I used so far removed `"'t"` – Rollo99 Sep 29 '21 at 08:34
  • 1
    Then do it with 2 steps, [remove all punctuation excluding "'"](https://stackoverflow.com/a/8698368/680068), then remove "'s" using fixed match. – zx8754 Sep 29 '21 at 08:41
  • Why the full stop "." is not removed at the end? – zx8754 Sep 29 '21 at 08:43

2 Answers2

2

You can use a look ahead (?!t) testing that the [:punct:] is not followed by a t.

gsub("[[:punct:]](?!t)\\w?", "", s, perl=TRUE)
#[1] "I don't want to remove punctuation for negations Instead I want to remove only general punctuation For example keep I wouldn't like it but remove Inter fan or Man city fan"

In case you want to be more strict you can test in addition if there is no n before with (?<!n).

gsub("(?<!n)[[:punct:]](?!t)\\w?", "", s, perl=TRUE)

Or in case to restrict it only to 't (thanks to @chris-ruehlemann)

gsub("(?!'t)[[:punct:]]\\w?", "", s, perl=TRUE)

Or remove every punct but not ' or 's:

gsub("[^'[:^punct:]]|'s", "", s, perl = TRUE)

The same but use look ahead:

gsub("(?!')[[:punct:]]|'s", "", s, perl = TRUE)
GKi
  • 37,245
  • 2
  • 26
  • 48
1

We can do it in two steps, remove all punctuation excluding "'", then remove "'s" using fixed match:

gsub("'s", "", gsub("[^[:alnum:][:space:]']", "", s), fixed = TRUE)
zx8754
  • 52,746
  • 12
  • 114
  • 209