2

I already saw this one, but it is not quite what I need:


Situation: Using gsub, I want to clean up strings. These are my conditions:

  1. Keep words only (no digits nor "weird" symbols)
  2. Keep those words separated with one of (just one) ' - _ $ . as one. For example: don't, re-loading, come_home, something$col
  3. keep specific names, such as package::function or package::function()

So, I have the following:

  1. [^A-Za-z]
  2. ([a-z]+)(-|'|_|$)([a-z]+)
  3. ([a-z]+(_*)[a-z]+)(::)([a-z]+(_*)[a-z]+)(\(\))*

Examples:

If I have the following:

# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't
# Needs to handle NA for desc::desc_get()
# Update href of toc anchors , use "-" instead "."
# Keep something$col or here_you::must_stay

I would like to have

Re-loading pkgdown while it's running causes weird behaviour with the context cache don't
Needs to handle NA for desc::desc_get()
Update href of toc anchors use instead
Keep something$col or here_you::must_stay

Problems: I have several:

A. The second expression is not working properly. Right now, it only works with - or '

B. How do I combine all of these in a single gsub in R? I want to do something like gsub(myPatterns, myText), but don't know how to fix and combine all of this.

Carrol
  • 1,225
  • 1
  • 16
  • 29
  • 1
    Try `trimws(gsub("(?:\\w+::\\w+(?:\\(\\))?|\\p{L}+(?:[-'_$]\\p{L}+)*)(*SKIP)(*F)|[^\\p{L}\\s]", "", myText, perl=TRUE))`. See [the regex demo](https://regex101.com/r/UEghUj/1). – Wiktor Stribiżew Nov 17 '20 at 20:28
  • That works like a charm! Can you please put it as an answer? – Carrol Nov 17 '20 at 20:34

2 Answers2

4

You can use

trimws(gsub("(?:\\w+::\\w+(?:\\(\\))?|\\p{L}+(?:[-'_$]\\p{L}+)*)(*SKIP)(*F)|[^\\p{L}\\s]", "", myText, perl=TRUE))

See the regex demo. Or, to also replace multiple whitespaces with a single space, use

trimws(gsub("\\s{2,}", " ", gsub("(?:\\w+::\\w+(?:\\(\\))?|\\p{L}+(?:[-'_$]\\p{L}+)*)(*SKIP)(*F)|[^\\p{L}\\s]", "", myText, perl=TRUE)))

Details

  • (?:\w+::\w+(?:\(\))?|\p{L}+(?:[-'_$]\p{L}+)*)(*SKIP)(*F): match either of the two patterns:
    • \w+::\w+(?:\(\))? - 1+ word chars, ::, 1+ word chars and an optional () substring
    • | - or
    • \p{L}+ - one or more Unicode letters
    • (?:[-'_$]\p{L}+)* - 0+ repetitions of -, ', _ or $ and then 1+ Unicode letters
  • (*SKIP)(*F) - omits and skips the match
  • | - or
  • [^\p{L}\s] - any char but a Unicode letter and whitespace

See the R demo:

myText <- c("# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't",
"# Needs to handle NA for desc::desc_get()",
'# Update href of toc anchors , use "-" instead "."',
"# Keep something$col or here_you::must_stay")
trimws(gsub("\\s{2,}", " ", gsub("(?:\\w+::\\w+(?:\\(\\))?|\\p{L}+(?:[-'_$]\\p{L}+)*)(*SKIP)(*F)|[^\\p{L}\\s]", "", myText, perl=TRUE)))

Output:

[1] "Re-loading pkgdown while it's running causes weird behaviour with the context cache don't"
[2] "Needs to handle NA for desc::desc_get()"                                                  
[3] "Update href of toc anchors use instead"                                                   
[4] "Keep something$col or here_you::must_stay"    
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

Alternatively,

txt <- c("# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't", 
         "# Needs to handle NA for desc::desc_get()",
         "# Update href of toc anchors , use \"-\" instead \".\"", 
         "# Keep something$col or here_you::must_stay")
expect <- c("Re-loading pkgdown while it's running causes weird behaviour with the context cache don't",
            "Needs to handle NA for desc::desc_get()",
            "Update href of toc anchors use instead",
            "Keep something$col or here_you::must_stay")

leadspace <- grepl("^ ", txt)
gre <- gregexpr("\\b(\\s?[[:alpha:]]*(::|[-'_$.])?[[:alpha:]]*(\\(\\))?)\\b", txt)
regmatches(txt, gre, invert = TRUE) <- ""
txt[!leadspace] <- gsub("^ ", "", txt[!leadspace])
identical(expect, txt)
# [1] TRUE
r2evans
  • 141,215
  • 6
  • 77
  • 149