2

Is there any effective way to remove punctuation in text but keeping hyphenated expressions, such as "accident-prone"?

I used the following function to clean my text

clean.text = function(x)
{
  # remove rt
  x = gsub("rt ", "", x)
  # remove at
  x = gsub("@\\w+", "", x)
  x = gsub("[[:punct:]]", "", x)
  x = gsub("[[:digit:]]", "", x)
  # remove http
  x = gsub("http\\w+", "", x)
  x = gsub("[ |\t]{2,}", "", x)
  x = gsub("^ ", "", x)
  x = gsub(" $", "", x)
  x = str_replace_all(x, "[^[:alnum:][:space:]'-]", " ")
  #return(x)
}

and apply it on hyphenated expressions that returned

my_text <- "accident-prone"
new_text <- clean.text(text)
new_text
[1] "accidentprone"

while my desired output is

"accident-prone"

I have referenced this thread but didn't find it worked on my situation. There must be some regex things that I haven't figured out. It will be really appreciated if someone could enlighten me on this.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
Chris T.
  • 1,699
  • 7
  • 23
  • 45
  • When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. It would be better to give a longer list of test cases. There are plenty of other functions you can run that won't remove that dash, but we need to better know what you do want to get rid of. Does this work for any two word structure "dolphin-become"? – MrFlick Mar 05 '18 at 16:37
  • I believe my question above includes 1) reproducible example, 2) sample input, and 3) desired output as well as pointing out why my code may need to be modified (hence it did not work on "dolphin-become" as you might expect, it simply generated "dolphinbecome"). – Chris T. Mar 05 '18 at 16:42
  • Well, `clean.text <- identity` would "work" on the example case you provided (it works by doing nothing). The point is that it's very easy to not remove dashes, just don't remove dashes. We don't know what you actually want to remove. – MrFlick Mar 05 '18 at 16:44
  • I didn't get your point, `clean.text` is a user-defined function. – Chris T. Mar 05 '18 at 16:46
  • In r regex patterns with tabs as a target you need to double the backslash. Furthermore, I'm pretty sure the [[:punct:]] is removing your dashes. – IRTFM Mar 05 '18 at 16:48
  • @42 Thanks a bunch! It works out well after I modified `gsub("[[:punct:]]", "", x)` to `gsub("\\[[:punct:]]", "", x)`. – Chris T. Mar 05 '18 at 16:52
  • They were two separate points. First point was referring to this pattern: `"[ |\t]{2,}"` but maybe I was wrong since it was inside a character class. I don't think your suggested correction works – IRTFM Mar 05 '18 at 17:04
  • Do you mean I should "split" them into two words (tokens)? Like "accident" and "prone"? – Chris T. Mar 05 '18 at 17:07
  • See my answer regarding the dashes – IRTFM Mar 05 '18 at 17:08

2 Answers2

1

The :punct: set of characters includes the dash and you are removing them. You could make an alternate character class that omits the dash. You do need to pay special attention to the square-brackets placements and escape the double quote and the backslash:

 (test <- gsub("[]!\"#$%&'()*+,./:;<=>?@[\\^_`{|}~]", "", "my-test of #$%^&*") )
[1] "my-test of "

The ?regex (help page) advises against using ranges. I investigated whether there might be any simplification using my local ASCII sequence of punctuation, but it quickly became obvious that was not the way to go for other reasons. There were 5 separate ranges, and the "]" was in the middle of one of them so there would have been 7 ranges to handle in addition to the "]" which needs to come first.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • You suggested edits and my previous code work on the example cases, but when I then loaded the function onto a large batch of text (cut-and-paste from web and saved in csv format), some phrases maintained their hyphenated forms but some didn't. For example, the "accident-prone" failed to keep its hyphenated form after the cleaning (but worked fine when entered manually) I am thinking this might be an encoding issue but that's beyond me. – Chris T. Mar 05 '18 at 17:13
  • There are more than one form of dash. You need to post examples of cases where you got failure. – IRTFM Mar 05 '18 at 17:27
  • Ok, I'll edit my original question and post it later to include a dropbox link for my working dataset. I think the reason may be because I used RTextTools's `create_matrix()` function as a shortcut to create document-term-matrix, the resulting `dtm` output contains terms that are not correctly stemmed as desired. – Chris T. Mar 05 '18 at 17:58
1

Putting my two cents in, you could use (*SKIP)(*FAIL) with perl = TRUE and remove any non-word characters:

data <- c("my-test of #$%^&*", "accident-prone")
(gsub("(?<![^\\w])[- ](?=\\w)(*SKIP)(*FAIL)|\\W+", "", data, perl = TRUE))

Resulting in

[1] "my-test of"     "accident-prone"

See a demo on regex101.com.


Here the idea is to match what you want to keep
(?<![^\\w])[- ](?=\\w)
# a whitespace or a dash between two word characters
# or at the very beginning of the string

let these fail with (*SKIP)(*FAIL) and put what you want to be removed on the right side of the alternation, in this case

\W+

effectively removing any non-word-characters not between word characters.
You'd need to provide more examples for testing though.

Jan
  • 42,290
  • 8
  • 54
  • 79
  • Thanks for weighing in. A quick way to handle hyphenated words situation is to specify `preserve_intra_word_dashes = TRUE` in `tm`'s `removePunctuation()` function. This control, however, will be lost once I transform the stemmed corpus into document-term-matrix using most of the text mining functions, and I couldn't figure out why. – Chris T. Mar 05 '18 at 20:36