4

I am using R to tokenize a set of texts; after tokenization I end up with a char vector in which punctuation signs, apostrophes and hyphens are preserved. For instance, I have this original text

txt <- "this ain't a Hewlett-Packard box - it's an Apple box, a very nice one!"

After the tokenization (which I perform using scan_tokenizer from package tm) I get the following char vector

   > vec1
 [1] "this"            "ain't"           "a"               "Hewlett-Packard"
 [5] "box"             "-"               "it's"            "an"             
 [9] "Apple"           "box,"            "a"               "very"           
[13] "nice"            "one!"           

Now in order to get rid of the punctuation marks I do the following

vec2 <- gsub("[^[:alnum:][:space:]']", "", vec1)

This is, I substitute everything that is not alphanumerical values, spaces and apostrophes by ""; however this is the result

> vec2
 [1] "this"           "ain't"          "a"              "HewlettPackard" "box"           
 [6] ""               "it's"           "an"             "Apple"          "box"           
[11] "a"              "very"           "nice"           "one"    

I want to preserve hyphenated words sych as "Hewlett-Pacakard", while getting rid of lone hyphens. Basically I need a regex to exclude hyphenated word of the form \w-\w in the gsub expression for vec2.

Your suggestions are much welcome

6 Answers6

5

If you just wnat to remove "pure hyphens" then use the pattern '^-$' (since the hyphen is not a regex meta-character.

vec2 <- vec1[!grepl( '^-$' , vec1) ]

If you wanted to remove "naked punctuation of all sorts" it might be:

vec2 <- vec1[!grepl( '^[[:punct:]]$' , vec1) ]
IRTFM
  • 258,963
  • 21
  • 364
  • 487
3
strsplit(gsub('[[:punct:]](?!\\w)', '', txt, perl=T), ' ')[[1]]
 #[1] "this"            "ain't"           "a"              
 #[4] "Hewlett-Packard" "box"             ""               
 #[7] "it's"            "an"              "Apple"          
#[10] "box"             "a"               "very"           
#[13] "nice"            "one"

Or you can do this to keep the exclamation point after "one":

strsplit(gsub('(?<!\\w)[[:punct:]](?!\\w)', '', txt,perl=T), ' ')[[1]]
#  [1] "this"            "ain't"           "a"              
#  [4] "Hewlett-Packard" "box"             ""               
#  [7] "it's"            "an"              "Apple"          
# [10] "box,"            "a"               "very"           
# [13] "nice"            "one!"

I am using regex lookbehinds and lookaheads. The pattern (?!\\w) is a lookahead (more precisely, a negative lookahead) and tells the evaluator to remove all punctuation marks except for those that are followed by a letter or number. In the second pattern, (?<!\\w) is considered a negative lookbehind. It will remove all punctuation marks except for those that come after a letter or number. To help remember the difference, a lookbehind looks "back" at the next token, a lookahead looks "up" at what comes before it.

Pierre L
  • 28,203
  • 6
  • 47
  • 69
2
strsplit(gsub("[^[:alnum:][:space:]'-]", "", txt), '\\s|\\ - ')
Shenglin Chen
  • 4,504
  • 11
  • 11
2

You may try this,

> library(stringr)    
> txt <- "this ain't a Hewlett-Packard box - it's an Apple box, a very nice one!"
> gsub("(?!\\b['-]\\b|\\s)[\\W_]", "", str_extract_all(txt, "\\S+")[[1]], perl=T)
 [1] "this"            "ain't"           "a"              
 [4] "Hewlett-Packard" "box"             ""               
 [7] "it's"            "an"              "Apple"          
[10] "box"             "a"               "very"           
[13] "nice"            "one"  

or

> strsplit(gsub('(?!\\b[[:punct:]]\\b|\\s)[\\W_]', '', txt,perl=T), ' ')[[1]]
 [1] "this"            "ain't"           "a"              
 [4] "Hewlett-Packard" "box"             ""               
 [7] "it's"            "an"              "Apple"          
[10] "box"             "a"               "very"           
[13] "nice"            "one" 
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
2

Here's an approach using strsplit with word boundaries (\b) and non-word characters (\W which is equivalent to [^[:alnum:]_])

strsplit(txt, "\\b | \\b|\\W |\\W$")
#[[1]]
# [1] "this"            "ain't"           "a"               "Hewlett-Packard"
# [5] "box"             ""                "it's"            "an"             
# [9] "Apple"           "box"             "a"               "very"           
#[13] "nice"            "one"            

Or to return nothing at all for the lone hyphen instead of "".

strsplit(txt, "\\b | \\b| ?\\W |\\W$")
#[[1]]
# [1] "this"            "ain't"           "a"               "Hewlett-Packard"
# [5] "box"             "it's"            "an"              "Apple"          
# [9] "box"             "a"               "very"            "nice"
#[13] "one"
Jota
  • 17,281
  • 7
  • 63
  • 93
1

I suggest two approaches, first, keep it is as simple as possible, and second, use Unicode character classes whenever possible, especially for things like hyphens that various text processors may substitute other characters for (see for instance http://www.fileformat.info/info/unicode/category/Pd/list.htm).

So:

Simplest (and also very fast), a binary match to detect only the hyphens:

vec1[!(vec1 %in% "-")]

Better (from a Unicode standpoint), also pretty fast:

vec1[!stringi::stri_detect_regex(vec1, "^\\p{Pd}$")]

The last one uses the Unicode character class Pd, representing "a dash or hyphen punctuation mark". This includes non-breaking hyphens, em dashes, etc. and the ^ and $ at the beginning and end of the regular expression mean this will be a standalone character.

Ken Benoit
  • 14,454
  • 27
  • 50