I am trying to modify a stemming function that is able to 1) remove hyphens in http (that appeared in the corpus) but, meanwhile, 2) preserve hyphens that appeared in meaningful hyphenated expressions (e.g., time-consuming, cost-prohibitive, etc.). I actually asked similar questions a few months ago on a different question thread, the code looks like this:
# load stringr to use str_replace_all
require(stringr)
clean.text = function(x)
{
# remove rt
x = gsub("rt ", "", x)
# remove at
x = gsub("@\\w+", "", x)
x = gsub("[[:punct:]]", "", x)
x = gsub("[[:digit:]]", "", x)
# remove http
x = gsub("http\\w+", "", x)
x = gsub("[ |\t]{2,}", "", x)
x = gsub("^ ", "", x)
x = gsub(" $", "", x)
x = str_replace_all(x, "[^[:alnum:][:space:]'-]", " ")
#return(x)
}
# example
my_text <- "accident-prone"
new_text <- clean.text(text)
new_text
[1] "accidentprone"
but could not get satisfactory answer, I then shifted my attention to other projects until resuming to work on this. It appears that the "[^[:alnum:][:space:]'-]"
in the last line of the code block is the culprit that also removed -
from the non-http part of corpus.
I could not figure out how to achieve our desired outputs, it will be very appreciated if someone could offer their insights on this.