How to trim a long string after several colons to n characters?

Question

I would like to do this with gsub and/or dplyr in R:

Ok here's the example text:

example_string <-
  "Bing Bloop Doop:-14490 Flerp:01 ScoobyDoot:Z1Bling Blong:Zootsuitssasdfasdf"

What I'd like to get:

"Bing Bloop Doop: Flerp: ScoobyDoot:Z1Bling Blong:Zootsuit"

I'd like to strip all the numbers (and any hyphens) except for the Z# and then limit the nchar after the last colon to 9 char. There are always 4 colons.

I'm going through all different kinds of threads and they get close sometimes but no cigar.

I was able to remove all digits:

gsub('[0-9]+', '', example_string)

But this doesn't trim nchar to 9 at the end and also takes out the Z1 part:

"Bing Bloop Doop: Flerp: ScoobyDoot:ZBling Blong:Zootsuitssasdfasdf"

Remove n number of characters in a string after a specific character

Regex allow a string to only contain numbers 0 - 9 and limit length to 45

bdbmax · Accepted Answer · 2022-10-17T18:09:48.433

1

Here is an option using a combination of base R functions (strsplit, lapply and gsub). Splitting the strings using the colon, and then iterate over each split element to re-split them using a space, detecting a combination of digits (positive/negative) and then re-collapsing using the space and the colon.

# Split the string by the colon
colon_split <- unlist(strsplit(example_string, ":"))

# Over all strings split by the colon
digits_out <- lapply(colon_split, \(x) {
  space <- unlist(strsplit(x, "\\s"))
  gsub("^-(\\d*)$|^(\\d*)$", "", space) |> paste0(collapse = " ")
})

# Regroup and collapse using the colon  
paste0(digits_out, collapse = ":")

edited Oct 17 '22 at 18:09

answered Oct 17 '22 at 18:00

bdbmax

341
6

That works thank you! Can you help me understand the regex in gsub? like I dont understand the $, |, and ^ parts – SqueakyBeak Oct 17 '22 at 18:05
1

Sure! The `^` is indicating the start of the string. `\\d` is any single digit, and `*` is zero or more of what preceeds. `|` is just 'or'. `$` is the end of the string. In words, `gsub("^-(\\d*)$|^(\\d*)$", "", space)` means: Replace by nothing a string that starts by `-` then followed by and only by any amount of single digits, until the end of the string. OR, replace by nothing a string that starts by any amount of single digits, until the end of the string. I suggest you go read on regular expressions. Here is a great website for testing them: https://regex101.com/ – bdbmax Oct 17 '22 at 18:12

score 0 · Answer 2 · answered Oct 18 '22 at 05:50

TLDR solution:

If you expect your conditions to remain the same, this solution doesn't require iterating through the different colon strings, using gsub only once.

example_string <- 
    "Bing Bloop Doop:-14490 Flerp:01 ScoobyDoot:Z1Bling Blong:Zootsuitssasdfasdf"

# use gsub to replace with ""
re_pattern <- "(?<!Z)[0-9]|-"
res <- gsub(re_pattern, "", example_string,perl=T) # perl allows use of complex regex patterns

# split up by colon
colon_splits = unlist(strsplit(res, ":"))

# get substring of last (5th) colon string 
last_colon_str = substring(colon_splits[5],0,8) # (chars 0-8 since result wanted was "Zootsuit")

str_list = c(colon_splits[0:4],last_colon_str) # collapse using c()

# regroup and collapse (thanks bdbmax)
paste0(str_list, collapse=":")

Regex Explanation

Each | is an OR conditional separator. You can usually use it when there's multiple patterns to match.

That being said, there are 2 "groups" of patterns here:

1. Z# and numbers `"(?<!Z)[0-9]"`

We can further break down (?<!Z)[0-9] to:

"(?<!Z)" # a negative lookbehind
"[0-9]" # a group that captures a range of 0 to 9

this allows you to match numbers 0 to 9 ([0-9], see 2 for explanation) that are not preceded by a 'Z' ((?<!Z)).

For more info you can take a look at this explanation of lookarounds

2. hyphen `-`

the hyphen that you want to remove. If you would like to include more characters, you can do:

"[-$]"

bear in mind that if you include other characters, in order to prevent the - from being interpreted as a range, you need to use escape characters:

"[1\-2]" # matches "1", "2", and "-"

How to trim a long string after several colons to n characters?

2 Answers2

TLDR solution:

Regex Explanation

That being said, there are 2 "groups" of patterns here:

1. Z# and numbers "(?<!Z)[0-9]"

2. hyphen -

1. Z# and numbers `"(?<!Z)[0-9]"`

2. hyphen `-`