Removing hyphens in http but preserving hyphenated words in corpus

Question

I am trying to modify a stemming function that is able to 1) remove hyphens in http (that appeared in the corpus) but, meanwhile, 2) preserve hyphens that appeared in meaningful hyphenated expressions (e.g., time-consuming, cost-prohibitive, etc.). I actually asked similar questions a few months ago on a different question thread, the code looks like this:

# load stringr to use str_replace_all
require(stringr)

clean.text = function(x)
{
  # remove rt
  x = gsub("rt ", "", x)
  # remove at
  x = gsub("@\\w+", "", x)
  x = gsub("[[:punct:]]", "", x)
  x = gsub("[[:digit:]]", "", x)
  # remove http
  x = gsub("http\\w+", "", x)
  x = gsub("[ |\t]{2,}", "", x)
  x = gsub("^ ", "", x)
  x = gsub(" $", "", x)
  x = str_replace_all(x, "[^[:alnum:][:space:]'-]", " ")
  #return(x)
}

# example
my_text <- "accident-prone"
new_text <- clean.text(text)
new_text
[1] "accidentprone"

but could not get satisfactory answer, I then shifted my attention to other projects until resuming to work on this. It appears that the "[^[:alnum:][:space:]'-]" in the last line of the code block is the culprit that also removed - from the non-http part of corpus.

I could not figure out how to achieve our desired outputs, it will be very appreciated if someone could offer their insights on this.

Try replacing `str_replace_all(x, "[^[:alnum:][:space:]'-]", " ")` with `gsub("\\b-\\b(*SKIP)(*F)|[^[:alnum:][:space:]'-]", " ", x, perl=TRUE)` or - to keep the pattern Unicode aware - `gsub("(*UCP)\\b-\\b(*SKIP)(*F)|[^\\w\\s'-]|_", " ", x, perl=TRUE)` — Wiktor Stribiżew, Oct 05 '18 at 11:18
Ok, replace `x = gsub("[[:punct:]]", "", x)` with `x = gsub("(?!-)[[:punct:]]", "", x, perl=TRUE)`. Note that still you may get rid of `stringr` by replacing the `str_replace` line with `x = gsub("[^[:alnum:][:space:]'-]", " ", x)` — Wiktor Stribiżew, Oct 05 '18 at 11:45

score 1 · Accepted Answer · answered Oct 05 '18 at 11:55

The actual culprit is the [[:punct:]] removing pattern as it matches - anywhere in the string.

You may use

clean.text <- function(x)
{
  # remove rt
  x <- gsub("rt\\s", "", x)
  # remove at
  x <- gsub("@\\w+", "", x)
  x <- gsub("\\b-\\b(*SKIP)(*F)|[[:punct:]]", "", x, perl=TRUE)
  x <- gsub("[[:digit:]]+", "", x)
  # remove http
  x <- gsub("http\\w+", "", x)
  x <- gsub("\\h{2,}", "", x, perl=TRUE)
  x <- trimws(x)
  x <- gsub("[^[:alnum:][:space:]'-]", " ", x)
  return(x)
}

Then,

my_text <- "  accident-prone  http://www.some.com  rt "
new_text <- clean.text(my_text)
new_text 
## => [1] "accident-prone"

See the R demo.

Note:

x = gsub("^ ", "", x) and x = gsub(" $", "", x) can be replaced with trimws(x)
gsub("\\b-\\b(*SKIP)(*F)|[[:punct:]]", "", x, perl=TRUE) removes any punctuation BUT hyphens in between word chars (you may adjust this further in the part before (*SKIP)(*F))
gsub("[^[:alnum:][:space:]'-]", " ", x) is a base R equivalent for str_replace_all(x, "[^[:alnum:][:space:]'-]", " ").
gsub("\\h{2,}", "", x, perl=TRUE) remove any 2 or more horizontal whitespaces. If by "[ |\t]{2,}" you meant to match any 2 or more whitespaces, use \\s instead of \\h here.

Thank you so much for this very detailed explanation! I will read more on `gsub`. — Chris T., Oct 05 '18 at 18:33

Removing hyphens in http but preserving hyphenated words in corpus

1 Answers1