0

I've been working on a r function to filter a large data frame of baseball team batting stats by game id, (i.e."2016/10/11/chnmlb-sfnmlb-1"), to create a list of past team matchups by season.

When I use some combinations of teams, output is correct, but others are not. (output contains a variety of ids)

I'm not real familiar with grep, and assume that is the problem. I patched my grep line and list output together by searching stack overflow and thought I had it till testing proved otherwise.

matchup.func <- function (home, away, df) {

    matchups <- grep(paste('[0-9]{4}/[0-9]{2}/[0-9]{2}/[', home, '|', away, 'mlb]{6}-[', away, '|', home, 'mlb]{6}-[0-9]{1}', sep = ''), df$game.id, value = TRUE)

    df <- df[df$game.id %in% matchups, c(1, 3:ncol(df))]

    out <- list()
    for (n in 1:length(unique(df$season))) {
        for (s in unique(df$season)[n]) {
            out[[s]] <- subset(df, season == s)
        }
    }
    return(out)
}

sample of data frame:

bat.stats[sample(nrow(bat.stats), 3), ]
       date                        game.id team wins losses flag ab r  h d t hr rbi bb po da so lob   avg   obp   slg   ops   roi season
1192 2016-04-11 2016/04/11/texmlb-seamlb-1  sea    2      5 away 38 7 14 3 0 0    7  2 27  8 11  15 0.226 0.303 0.336 0.639 0.286      R
764  2016-03-26 2016/03/26/wasmlb-slnmlb-1  sln    8     12 away 38 7  9 2 1 1    5  2 27  8 11  19 0.289 0.354 0.474 0.828 0.400      S
5705 2016-09-26 2016/09/26/oakmlb-anamlb-1  oak   67     89 home 29 2  6 1 0 1    2  2 27 13  4  12 0.260 0.322 0.404 0.726 0.429      R

sample of errant output:

matchup.func('tex', 'sea', bat.stats)
$S
          date team wins losses flag ab  r  h d t hr rbi bb po da so lob   avg   obp   slg   ops   roi season
21  2016-03-02  atl    1      0 home 32  4  7 0 0  2   3  2 27 19  2  11 0.203 0.222 0.406 0.628 1.000      S
22  2016-03-02  bal    0      1 away 40 11 14 3 2  2  11 10 27 13  4  28 0.316 0.415 0.532 0.947 0.000      S
47  2016-03-03  bal    0      2 home 41 10 17 7 0  2  10  0 27  9  3  13 0.329 0.354 0.519 0.873 0.000      S
48  2016-03-03  tba    1      1 away 33  3  5 0 1  0   3  2 24 10  8  13 0.186 0.213 0.343 0.556 0.500      S
141 2016-03-05  tba    2      2 home 35  6  6 2 0  0   5  3 27 11  5  15 0.199 0.266 0.318 0.584 0.500      S
142 2016-03-05  bal    0      5 away 41 10 17 5 1  0  10  4 27  9 10  13 0.331 0.371 0.497 0.868 0.000      S

sample of good:

matchup.func('bos', 'bal', bat.stats)
$S
          date team wins losses flag ab  r  h d t hr rbi bb po da so lob   avg   obp   slg   ops   roi season
143 2016-03-06  bal    0      6 home 34  8 14 4 0  0   8  5 27  5  8  22 0.284 0.330 0.420 0.750 0.000      S
144 2016-03-06  bos    3      2 away 38  7 10 3 0  0   7  7 24  7 13  25 0.209 0.285 0.322 0.607 0.600      S
209 2016-03-08  bos    4      3 home 37  1 12 1 1  0   1  4 27 15  8  26 0.222 0.292 0.320 0.612 0.571      S
210 2016-03-08  bal    0      8 away 36  5 12 5 0  1   4  4 27  9  4  27 0.283 0.345 0.429 0.774 0.000      S

On the good it gives a list of matchups as it should, (i.e. S, R, F, D), on the bad it outputs by season, but seems to only give matchups by date and not team. Not sure what to think.

Cœur
  • 37,241
  • 25
  • 195
  • 267
  • Can you add samples? And the ones on which it worked and ones in which it didn't? – ar7 Oct 12 '16 at 15:31
  • 1
    Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610). This will make it much easier for others to help you. – Jaap Oct 12 '16 at 15:43
  • Yes, sorry about that, new to this. I hope my edit helps. – user3138409 Oct 12 '16 at 16:10

1 Answers1

2

I think that the issue is that regex inside [] behaves differently than you might expect. Specifically, it is looking for any matches to those characters, and in any order. Instead, you might try

matchups <- grep(paste0("(", home, "|", away, ")mlb-(", home, "|", away, ")mlb")
                 , df$game.id, value = TRUE)

That should give you either the home or the away team, followed by either the home or away team. Without more sample data though, I am not sure if this will catch edge cases.

You should also note that you don't have to match the entire string, so the date-finding regex at the beginning is likely superfluous.

Mark Peterson
  • 9,370
  • 2
  • 25
  • 48