2

I am trying to modify a stemming function that is able to 1) remove hyphens in http (that appeared in the corpus) but, meanwhile, 2) preserve hyphens that appeared in meaningful hyphenated expressions (e.g., time-consuming, cost-prohibitive, etc.). I actually asked similar questions a few months ago on a different question thread, the code looks like this:

# load stringr to use str_replace_all
require(stringr)

clean.text = function(x)
{
  # remove rt
  x = gsub("rt ", "", x)
  # remove at
  x = gsub("@\\w+", "", x)
  x = gsub("[[:punct:]]", "", x)
  x = gsub("[[:digit:]]", "", x)
  # remove http
  x = gsub("http\\w+", "", x)
  x = gsub("[ |\t]{2,}", "", x)
  x = gsub("^ ", "", x)
  x = gsub(" $", "", x)
  x = str_replace_all(x, "[^[:alnum:][:space:]'-]", " ")
  #return(x)
}

# example
my_text <- "accident-prone"
new_text <- clean.text(text)
new_text
[1] "accidentprone"

but could not get satisfactory answer, I then shifted my attention to other projects until resuming to work on this. It appears that the "[^[:alnum:][:space:]'-]" in the last line of the code block is the culprit that also removed - from the non-http part of corpus.

I could not figure out how to achieve our desired outputs, it will be very appreciated if someone could offer their insights on this.

Chris T.
  • 1,699
  • 7
  • 23
  • 45
  • Try replacing `str_replace_all(x, "[^[:alnum:][:space:]'-]", " ")` with `gsub("\\b-\\b(*SKIP)(*F)|[^[:alnum:][:space:]'-]", " ", x, perl=TRUE)` or - to keep the pattern Unicode aware - `gsub("(*UCP)\\b-\\b(*SKIP)(*F)|[^\\w\\s'-]|_", " ", x, perl=TRUE)` – Wiktor Stribiżew Oct 05 '18 at 11:18
  • It still gives the same result. – Chris T. Oct 05 '18 at 11:25
  • Ok, replace `x = gsub("[[:punct:]]", "", x)` with `x = gsub("(?!-)[[:punct:]]", "", x, perl=TRUE)`. Note that still you may get rid of `stringr` by replacing the `str_replace` line with `x = gsub("[^[:alnum:][:space:]'-]", " ", x)` – Wiktor Stribiżew Oct 05 '18 at 11:45

1 Answers1

1

The actual culprit is the [[:punct:]] removing pattern as it matches - anywhere in the string.

You may use

clean.text <- function(x)
{
  # remove rt
  x <- gsub("rt\\s", "", x)
  # remove at
  x <- gsub("@\\w+", "", x)
  x <- gsub("\\b-\\b(*SKIP)(*F)|[[:punct:]]", "", x, perl=TRUE)
  x <- gsub("[[:digit:]]+", "", x)
  # remove http
  x <- gsub("http\\w+", "", x)
  x <- gsub("\\h{2,}", "", x, perl=TRUE)
  x <- trimws(x)
  x <- gsub("[^[:alnum:][:space:]'-]", " ", x)
  return(x)
}

Then,

my_text <- "  accident-prone  http://www.some.com  rt "
new_text <- clean.text(my_text)
new_text 
## => [1] "accident-prone"

See the R demo.

Note:

  • x = gsub("^ ", "", x) and x = gsub(" $", "", x) can be replaced with trimws(x)
  • gsub("\\b-\\b(*SKIP)(*F)|[[:punct:]]", "", x, perl=TRUE) removes any punctuation BUT hyphens in between word chars (you may adjust this further in the part before (*SKIP)(*F))
  • gsub("[^[:alnum:][:space:]'-]", " ", x) is a base R equivalent for str_replace_all(x, "[^[:alnum:][:space:]'-]", " ").
  • gsub("\\h{2,}", "", x, perl=TRUE) remove any 2 or more horizontal whitespaces. If by "[ |\t]{2,}" you meant to match any 2 or more whitespaces, use \\s instead of \\h here.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563