Need to extract individual characters from a string column using R

Question

Background

Below is my gamedata dataset in dput form -- it contains line scores for some MLB games.

structure(list(team = c("NYM", "NYM", "BOS", "NYM", "BOS"), linescore = c("010000000", 
"(10)1140006x", "002200010", "00000(11)01x", "311200"), ondate = structure(c(18475, 
18476, 18487, 18489, 18494), class = "Date")), class = "data.frame", row.names = c(NA, 
-5L))

For example, here is one line score: "002200010".

Some line scores end with an 'x' and some have a two-digit value in parentheses, as in "00000(11)01x". Each digit not in parentheses indicates how many runs the team scored in the inning. If a team scores more than nine runs in an inning, the number is placed in parentheses, so in the line score "00000(11)01x" the team scored 11 runs in the sixth inning and did not come to bat in the bottom of the ninth (shown by the 'x').

Not every line score has nine innings. Some have more and some have as few as six.

What I need to do

First, what I need to do is to get how many runs a team scored in each inning, e.g., the first, second, third, and so on, and put each runs scored in a new column. I would prefer for the solution to use dplyr.

I have reviewed stackoverflow's suggested solutions, but found none that matched what I need. If there is one, I would appreciate it if you would share its URL.

I have tried to do it using this code:

gamedata %>%
  select(ondate, team, linescore) %>%
  mutate(inng1 = str_extract(linescore, "\\d|\\(\\d{2}\\)"))

Here is the output:

ondate      team linescore    inng1
2020-08-01  NYM 010000000       0   
2020-08-02  NYM (10)1140006x  (10)  
2020-08-13  BOS 002200010       0   
2020-08-15  NYM 00000(11)01x    0   
2020-08-20  BOS 311200          3

Second, how can I remove the parentheses in the inng1 column for '10'?

The code below produced the error beneath it:

gamedata %>%
  select(ondate, team, linescore) %>%
  mutate(inng1 = str_extract(linescore, "\\d|\\(\\d{2}\\)"))
 str_remove_all(inng1,"[()]")

This is the error message I got:

"Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement), : object 'inng1' not found"

Third, I need to know how to extract the runs scored for each additional inning, starting with the second, putting each value in its own column, e.g, inng2, inng3, and so one.

At the end, I should have the output shown above (without parentheses for each double-digit inning) with a column for each inning, so there would be a column titled "inng1," "inng2," inng3", "inng4", and so on. The data in the inning columns needs to be numeric as later I will be summing them.

The inng1 column contains the first value in the 'linescore' column for each observation. The inng2 column contains the second value in the 'linescore' column for each observation, and so on. Numbers in parentheses count as one value. — Metsfan, Jul 30 '21 at 15:59
can you tell me what does two closing brackets in any string mean? Is extra closing bracket is always followed by a `x`? — AnilGoyal, Jul 30 '21 at 16:26
AnilGoyal, if what you are referring to is the last parenthesis in the line score in the second observation, that is a mistake. There should never be a parenthesis after an 'x'. — Metsfan, Jul 30 '21 at 16:40
yes, I was referring to that only. See my answer. Please check. If that is working, I'll add suitable explanation. — AnilGoyal, Jul 30 '21 at 16:43
I think that have to be removed because OP has stated that he want numeric output — AnilGoyal, Jul 30 '21 at 17:06
You should have added expected output in respect of sample data to avoid any confusion. — AnilGoyal, Jul 31 '21 at 12:38

Anoushiravan R · Answer 1 · 2021-07-31T00:57:27.963

5

Solution02

Here is another solution you can use for this question which is much more efficient than the first one and is predominantly based on purrr family of functions:

library(dplyr)
library(purrr)

df %>%
  bind_cols(
    map(df %>% select(linescore), ~ strsplit(.x, "\\(|\\)")) %>%
      flatten() %>%
      map_dfr(~ map(.x, ~ if(nchar(.x) > 2) strsplit(.x, "")[[1]] else .x) %>%
                reduce(~ c(.x, .y)) %>%
                keep(~ nchar(.x) != 0) %>% t() %>%
                as_tibble() %>% 
                set_names(~ paste0("inng", 1:length(.x)))) %>%
      mutate(across(everything(), ~ replace(.x, .x == "x", NA_character_)), 
             count_inng = pmap_dbl(cur_data(), ~ sum(!is.na(c(...)))), 
             sums_inng = pmap_dbl(select(cur_data(), starts_with("inng")), 
                                  ~ sum(as.numeric(c(...)), na.rm = TRUE)))
  )

  team    linescore     ondate inng1 inng2 inng3 inng4 inng5 inng6 inng7 inng8 inng9 count_inng
1  NYM    010000000 2020-08-01     0     1     0     0     0     0     0     0     0          9
2  NYM (10)1140006x 2020-08-02    10     1     1     4     0     0     0     6  <NA>          8
3  BOS    002200010 2020-08-13     0     0     2     2     0     0     0     1     0          9
4  NYM 00000(11)01x 2020-08-15     0     0     0     0     0    11     0     1  <NA>          8
5  BOS       311200 2020-08-20     3     1     1     2     0     0  <NA>  <NA>  <NA>          6
  sums_inng
1         1
2        22
3         5
4        12
5         7

Solution01

I have made some modifications to my solution as it incorrectly replaced the double-digit number in the output vector and I think it has been fixed. I think this solution may help you. For this purpose I decided to write a custom function to detect two-digits numbers and also trim the output of your scores:

library(dplyr)
library(stringr)
library(tidyr)
library(purrr)

fn <- function(x) {
  out <- c()
  if(str_detect(x, "\\((\\d){2}\\)")) {
    double <- str_replace_all(str_extract(x, "\\((\\d){2}\\)"), "[)()]", "")
    ind <- str_locate(x, "\\(")
    x <- str_remove(x, "\\((\\d){2}\\)")
    out <- c(out, str_split(x, "")[[1]])
    out[(ind[1, 1]+1):(length(out)+1)] <- out[(ind[1, 1]):length(out)]
    out[ind] <- double
  } else {
    out <- c(out, str_split(x, "")[[1]])
  }
  if(any(grepl(")", out))) {
    out <- out[-which(out == ")")]
  }
  out
}

# Test
fn("(10)1140006x)")
[1] "10" "1"  "1"  "4"  "0"  "0"  "0"  "6"  "x"

Then we apply it on our data set in a row-wise operation:

df %>%
  mutate(linescore = map(linescore, fn)) %>% 
  unnest_wider(linescore) %>%
  rename_with(~ gsub("(\\.\\.\\.)(\\d)", paste0("inng", "\\2"), .), starts_with("...")) %>%
  mutate(across(starts_with("inng"), ~ {replace(.x, .x == "x", NA)
    as.numeric(.x)}), 
    inns_count = pmap_dbl(select(cur_data(), starts_with("inng")), 
                          ~ sum(!is.na(c(...)))), 
    inns_sums = pmap_dbl(select(cur_data(), starts_with("inng")), 
                         ~ sum(c(...), na.rm = TRUE)))

# A tibble: 5 x 13
  team  inng1 inng2 inng3 inng4 inng5 inng6 inng7 inng8 inng9 ondate     inns_count inns_sums
  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <date>          <dbl>     <dbl>
1 NYM       0     1     0     0     0     0     0     0     0 2020-08-01          9         1
2 NYM      10     1     1     4     0     0     0     6    NA 2020-08-02          8        22
3 BOS       0     0     2     2     0     0     0     1     0 2020-08-13          9         5
4 NYM       0     0     0     0     0    11     0     1    NA 2020-08-15          8        12
5 BOS       3     1     1     2     0     0    NA    NA    NA 2020-08-20          6         7

edited Jul 31 '21 at 00:57

answered Jul 30 '21 at 16:46

Anoushiravan R

21,622
3
18
41

Try converting charcters to numeric too. That's what is expected too – AnilGoyal Jul 30 '21 at 16:48
1

I need to make a slight modification to it first. Yes I will do it then. – Anoushiravan R Jul 30 '21 at 16:49
2

Bravo guys. I am trying further, how could I regex `(, 1, 0, )` to `10` – TarJae Jul 30 '21 at 16:59
1

@TarJae try reverse. Refer [this](https://stackoverflow.com/q/67886333/2884859) question – AnilGoyal Jul 30 '21 at 17:05
@AnilGoyal I fixed the problem. Just don't know what to do with `x` values? should they be there or what? – Anoushiravan R Jul 30 '21 at 17:06
1

If x be left anywhere, the values cannot be numerical. So have to be removed IMHO – AnilGoyal Jul 30 '21 at 17:11
1

Anoushiravan, the ‘x’ values are only used in baseball to indicate that the home team did not bat in the bottom half of the last inning. An ‘x’, when it appears, will only be at the end of a line score. – Metsfan Jul 30 '21 at 17:12
OK thanks, I replaced them with `NA` values and calculate the count and sum of each row. You may check my edits. – Anoushiravan R Jul 30 '21 at 17:16
1

@TarJae Thank you brother, we are waiting for you :) – Anoushiravan R Jul 30 '21 at 17:17
Anoushiravan, if I wanted to sum the values in one or more of the innings columns, how would I do it given that your code creates the column names? – Metsfan Jul 30 '21 at 18:26
@Metsfan I have done it in my solution with those 2 `pmap_dbl` function right at the end that will calculate the count and sum of the named vectors. – Anoushiravan R Jul 30 '21 at 18:41
But if it is only one or any number of column specified the name(s) in the second part of this `select(cur_data(), starts_with("inng"))` in place of `starts_with`. – Anoushiravan R Jul 30 '21 at 18:46
Anoushiravan, you are correct. In my last comment, I was asking about how to sum the rows. Is there a way of doing it besides exporting the output and then reading in the data? – Metsfan Jul 30 '21 at 19:45
I'm sorry I am afraid I did not understand. Would you please explain how you would like to sum the rows? – Anoushiravan R Jul 30 '21 at 19:55
Nothing needs to be done. I figured it out. Thanks. – Metsfan Jul 30 '21 at 23:38
@Metsfan I have provided another solution, more straight-forward and efficient. – Anoushiravan R Jul 31 '21 at 00:58
In this code -- (str_detect(x, "\\((\\d){2}\\)") -- why are two backslashes before both the first left parenthesis and the last right parenthesis rather than one? – Metsfan Jul 31 '21 at 13:07
This `(\\d){2}` matches a two-digit number and I wanted to check whether for example a `(10)` exists or not and since parenthesis are special metachacharacters we have to escape them with double back slash like `\\(` to emphasis we mean a literal `(` otherwise their behavior changes as they are used to specify capturing group and in that case we only capture the numbers inside without `()`. – Anoushiravan R Jul 31 '21 at 14:30

AnilGoyal · Accepted Answer · 2021-07-31T12:53:59.207

Something like this will do-

using base R's gsub for some regex transformation
using stringr::str_trim and stringr::str_count() (that is optional though)
using tidyr::separate
along with dplyr::mutate.

Steps-

remove x from string linescore (I mutate this into a new column, you may mutate existing column also)
replace every character outside parenthesis with that character plus a space using gsub again with the help of regex
thereafter remove parenthesis strings
use tidyr::separate to separate string into a different column each.
use convert = TRUE to convert each string into numeric.

For regex transformation explanation check this

library(tidyverse)
df <- structure(list(team = c("NYM", "NYM", "BOS", "NYM", "BOS"), linescore = c("010000000", 
                                                                                "(10)1140006x", "002200010", "00000(11)01x", "311200"), ondate = structure(c(18475, 
                                                                                                                                                             18476, 18487, 18489, 18494), class = "Date")), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                                -5L))

df %>%
  mutate(inn = gsub('x', '', linescore),
         inn = str_trim(gsub("(.)(?![^(]*\\))", "\\1 ", inn, perl=TRUE)),
         inn = gsub('\\(|\\)', '', inn),
         innings_count = 1 + str_count(inn, ' ')) %>%
  separate(inn, into = paste0('innings_', seq(max(.$innings_count))), sep = ' ', fill = 'right', convert = TRUE)
#>   team    linescore     ondate innings_1 innings_2 innings_3 innings_4
#> 1  NYM    010000000 2020-08-01         0         1         0         0
#> 2  NYM (10)1140006x 2020-08-02        10         1         1         4
#> 3  BOS    002200010 2020-08-13         0         0         2         2
#> 4  NYM 00000(11)01x 2020-08-15         0         0         0         0
#> 5  BOS       311200 2020-08-20         3         1         1         2
#>   innings_5 innings_6 innings_7 innings_8 innings_9 innings_count
#> 1         0         0         0         0         0             9
#> 2         0         0         0         6        NA             8
#> 3         0         0         0         1         0             9
#> 4         0        11         0         1        NA             8
#> 5         0         0        NA        NA        NA             6

When I run this code, I get the whole linescore from the second game in the "innings_1" column with the rest of the row filled out with NAs. — jdonland, Jul 30 '21 at 16:49
@thefringthing, I changed the sample data as per OP's comment above. Please check revised sample data — AnilGoyal, Jul 30 '21 at 16:56
Bravo guys. I am trying further, how could I regex `(, 1, 0, )` to `10` — TarJae, Jul 30 '21 at 16:59
AnilGoyal, When I tested your code, the output contains 'x' values; whereas, in your type conversion code, which generated errors when I ran it, the x's are all replaced with NA, which I prefer. — Metsfan, Jul 30 '21 at 19:47
@Metsfan, that means you want to remove `x` before hand. I assumed that may create an impact on number of innings that's why removed the `x` in last. Please check the revised code. I have added explanation too — AnilGoyal, Jul 31 '21 at 12:37
@AnilGoyal (1) You wrote: "using stringr::str_trim and stringr::str_count() (that is optional though)." If str_trim is optional, without it how do you add a space to separate the runs in each inning? (2) You wrote: "remove x from string linescore," but when I ran your revised code 'x' appears in the linescore column but not in any innings column, which is what I want. — Metsfan, Jul 31 '21 at 14:07
@Metsfan, see I have created a new column `inn` and left linescore as such? What do you want actually, I am unable to understand! — AnilGoyal, Jul 31 '21 at 14:25
AnilGoyal, What you had done is fine. Nothing more needs to be done. — Metsfan, Jul 31 '21 at 20:23

Need to extract individual characters from a string column using R

2 Answers2

Linked