regex for replacement of non-numeric character INSIDE parenthesis within a string in dyplr workflow

Question

My question is somehow related to an already answered question Need to extract individual characters from a string column using R.

I try to solve this question with my knowledge and need to know how to remove non numeric characters in parenthesis within a string: `

This is the dataframe with column x:

  team     linescore     ondate                                     x
1  NYM     010000000 2020-08-01             0, 1, 0, 0, 0, 0, 0, 0, 0
2  NYM (10)1140006x) 2020-08-02 (, 1, 0, ), 1, 1, 4, 0, 0, 0, 6, x, )
3  BOS     002200010 2020-08-13             0, 0, 2, 2, 0, 0, 0, 1, 0
4  NYM  00000(11)01x 2020-08-15    0, 0, 0, 0, 0, (, 1, 1, ), 0, 1, x
5  BOS        311200 2020-08-20                      3, 1, 1, 2, 0, 0

structure(list(team = c("NYM", "NYM", "BOS", "NYM", "BOS"), linescore = c("010000000", 
"(10)1140006x)", "002200010", "00000(11)01x", "311200"), ondate = structure(c(18475, 
18476, 18487, 18489, 18494), class = "Date"), x = list(c("0", 
"1", "0", "0", "0", "0", "0", "0", "0"), c("(", "1", "0", ")", 
"1", "1", "4", "0", "0", "0", "6", "x", ")"), c("0", "0", "2", 
"2", "0", "0", "0", "1", "0"), c("0", "0", "0", "0", "0", "(", 
"1", "1", ")", "0", "1", "x"), c("3", "1", "1", "2", "0", "0"
))), class = "data.frame", row.names = c(NA, -5L))

Desired Output:

  team     linescore     ondate                             x
1  NYM     010000000 2020-08-01     0, 1, 0, 0, 0, 0, 0, 0, 0
2  NYM (10)1140006x) 2020-08-02 10, 1, 1, 4, 0, 0, 0, 6, x, )
3  BOS     002200010 2020-08-13     0, 0, 2, 2, 0, 0, 0, 1, 0
4  NYM  00000(11)01x 2020-08-15    0, 0, 0, 0, 0, 11, 0, 1, x
5  BOS        311200 2020-08-20              3, 1, 1, 2, 0, 0

How can I change (, 1, 0, ) to 10 and (, 1, 1, ) to 11 and leave the rest as is.

Some help I already got so far:

regex for replacement of specific character outside parenthesis only thanks AnilGoyal
gsub("\\D+", "", str1) thanks to akrun
gsub("[(,) ]", "", "(, 1, 0, )") thanks to Anoushirvan

Thanks!

If it gets removed again, please flag it to moderators. ( I added it once again.) I find this as regex question — akrun, Jul 31 '21 at 05:41
This is indeed a regex question and very good question indeed. It also happened to me once that I got continuous downvotes of my regex question which later went upto +9 scrore — AnilGoyal, Aug 01 '21 at 08:15
@AnilGoyal It is not about that. If that is the case [here](https://stackoverflow.com/questions/68612942/in-r-and-regex-how-to-detect-a-character-with-excluding-some-mixed-condition) also the regex tag would be removed — akrun, Aug 01 '21 at 18:44
@AnilGoyal as is [here](https://stackoverflow.com/questions/68612942/in-r-and-regex-how-to-detect-a-character-with-excluding-some-mixed-condition). Both are posted within last day. So you can just guess the obvious reason :=) — akrun, Aug 01 '21 at 19:06

akrun · Accepted Answer · 2021-07-30T19:17:50.350

7

We could do this in base R. An option is to insert a delimiter between the characters that are outside the (...) with *SKIP/*FAIL, then remove the paired () while keeping the characters by capturing as a group, finally return the list by splitting at the , with strsplit

df1$x <-  strsplit(gsub("\\((\\d+)\\)", "\\1,",
    gsub("\\([^)]+\\)(*SKIP)(*FAIL)|(.)", "\\1,", 
      df1$linescore, perl = TRUE)),",")

-ouptut

df1$x
[[1]]
[1] "0" "1" "0" "0" "0" "0" "0" "0" "0"

[[2]]
 [1] "10" "1"  "1"  "4"  "0"  "0"  "0"  "6"  "x"  ")" 

[[3]]
[1] "0" "0" "2" "2" "0" "0" "0" "1" "0"

[[4]]
[1] "0"  "0"  "0"  "0"  "0"  "11" "0"  "1"  "x" 

[[5]]
[1] "3" "1" "1" "2" "0" "0"

edited Jul 30 '21 at 19:17

answered Jul 30 '21 at 19:07

akrun

874,273
37
540
662

1

@AnoushiravanR We match all the characters inside the `()`i.e. `\\(` match a `(` followed by one or characters not a `)` - `[^)]+`) and the closing parentheses (`\\)`), then we say to SKIP those cases i.e. the characters within the `()` including it, then we capture any character `(.)` and in the replacement, add the backreference (`\\1`) followed by a delimiter (`,`) so that it can be used in `strsplit` – akrun Jul 31 '21 at 20:06
Arun, so what is `(*FAIL)` doing there. Skip you have explained – AnilGoyal Aug 01 '21 at 07:54

score 3 · Answer 2 · answered Jul 31 '21 at 20:00

Here is another way that we could get to your desired output, I just figured out which is not relying on regex. However, the use of regex makes your solution much more elegant and compact:

library(purrr)

map(df %>% select(linescore), ~ strsplit(.x, "\\(|\\)")) %>%
      flatten() %>%
      map_dfr(~ map(.x, ~ if(nchar(.x) > 2) strsplit(.x, "")[[1]] else .x) %>%
                reduce(~ c(.x, .y)) %>%
                keep(~ nchar(.x) != 0) %>% t() %>%
                as_tibble() %>% 
                set_names(~ paste0("inng", 1:length(.x))))

# A tibble: 5 x 9
  inng1 inng2 inng3 inng4 inng5 inng6 inng7 inng8 inng9
  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 0     1     0     0     0     0     0     0     0    
2 10    1     1     4     0     0     0     6     x    
3 0     0     2     2     0     0     0     1     0    
4 0     0     0     0     0     11    0     1     x    
5 3     1     1     2     0     0     NA    NA    NA

Mine just omitted `)` that were in your desired output. I'm just trying to figure out a way to not omitting them. — Anoushiravan R, Jul 31 '21 at 20:27

regex for replacement of non-numeric character INSIDE parenthesis within a string in dyplr workflow

2 Answers2