2

I have a vector chars of some characters:

chars <- c("check24  smavey  dr klein", "smava", "check24, interhyp", 
  "verivox  check24  dr. klein", "dr. klein", NA, "dr. weber", 
  "dr. klein,", NA, "check24  verivox")

The goal is to paste/insert "_" if they have white space between them and fulfill the following conditions:

  1. There is no comma between the sequence (e.g. Name1, Name2 Name3 should become Name1, Name2_Name3).
  2. There is no point between them (e.g. Dr. Name1 Name2 Name3 should become Dr. Name1_Name2_Name3).
  3. The length between the whitespace is and the charcter sequence is >= 4 on both sides (e.g. AAA AAAA AAAA AAAA should become AA AAAA_AAAA_AAAA).

I tried using this function:

library(stringr)

f = function(x) {
  ifelse(grepl(".{4} .{4}", x) & !grepl(",|[A-z]{2}/. ", x), str_replace_all(x, "\\s+", "_"), x)
}

f(chars)
#> [1] "check24_smavey_dr_klein"   "smava"                     "check24, interhyp"         "verivox_check24_dr._klein"
#> [5] "dr. klein"                 NA                          "dr. weber"                 "dr. klein,"               
#> [9] NA                          "check24_verivox"        

The problem is that I can't execute the cases in a sequence (e.g. [1] or [4])

Any idea how to do this?

MSR
  • 2,731
  • 1
  • 14
  • 24
Banjo
  • 1,191
  • 1
  • 11
  • 28

2 Answers2

1

Is this what you're after?

chars <- c("check24  smavey  dr klein", "smava", "check24, interhyp", 
           "verivox  check24  dr. klein", "dr. klein", NA, "dr. weber", 
           "dr. klein,", NA, "check24  verivox")

library(stringr)

str_replace_all(chars, "([\\w]{4,})(?<=[^,.])[\\s]+([\\w]{4,})", "\\1_\\2")
#>  [1] "check24_smavey  dr klein"   "smava"                     
#>  [3] "check24, interhyp"          "verivox_check24  dr. klein"
#>  [5] "dr. klein"                  NA                          
#>  [7] "dr. weber"                  "dr. klein,"                
#>  [9] NA                           "check24_verivox"

Created on 2019-12-21 by the reprex package (v0.2.1)

Uses capturing groups of words length 4 (([\\w]{4,})) or more and then a look-ahead ((?<=[^,.])) to avoid commas and full stops.

MSR
  • 2,731
  • 1
  • 14
  • 24
1

You could match 1+ horizontal whitespace chars and assert what is on the left and right are 4 word characters and use an underscore in the replacement.

Instead of using \w you could also use [A-Za-z] instead and note that [A-z] matches more than that.

(?<=\w{4})\h+(?=\w{4})
  • (?<=\w{4}) Positive lookbehind, assert what is on the left are 4 word chars
  • \h+ Match 1+ horizontal whitespace char
  • (?=\w{4}) Positive lookahead, assert what is on the right are 4 word chars

Regex demo | R demo

For example

chars <- c("check24  smavey  dr klein", "smava", "check24, interhyp", 
  "verivox  check24  dr. klein", "dr. klein", NA, "dr. weber", 
  "dr. klein,", NA, "check24  verivox", "Name1, Name2 Name3", "Dr. Name1 Name2 Name3", "AAA AAAA AAAA AAAA")

gsub('(?<=\\w{4})\\h+(?=\\w{4})', '_', chars, perl=TRUE)

Output

 [1] "check24_smavey  dr klein"   "smava"                     
 [3] "check24, interhyp"          "verivox_check24  dr. klein"
 [5] "dr. klein"                  NA                          
 [7] "dr. weber"                  "dr. klein,"                
 [9] NA                           "check24_verivox"           
[11] "Name1, Name2_Name3"         "Dr. Name1_Name2_Name3"     
[13] "AAA AAAA_AAAA_AAAA"   
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • @Banjo I see that you have switched the accepted answer. That answer can be shortened to `(\w{4,})\s+(\w{4,})` https://regex101.com/r/5XqgSb/1 You don't need square brackets around `\w` and `\s` and you don't need the positive lookbehind `(?<=[^,.])` because that is always true as `\w` does not match a comma or dot. – The fourth bird Jan 18 '20 at 16:24