1

I'm trying to write a regular expression to get data about TV show episodes from a file name.

I'll start with showing a few examples of the kind of inputs I'm dealing with and how I'd like the data output.

Input:

  • showname.s01e01e02e03.extension
  • showname.s01e01-02-03.extension

Note: The amount of episodes can vary, but will at least be 2.

Output:

  • Season number, i.e. '01' from s01
  • Episode numbers, i.e. 01e02e03 or 01-02-03

Note: If it's possible to get individual episode numbers using regex, that'd be great as well - if not, I'll just split them.

What I've tried:

I'm not really that great with regular expressions, so my current attempt is probably horrible. Anyway, here's what I've got so far - obviously not working:

(?i)s(?<season>\\d{1,4})(e(\\d{1,3})){2,}

My idea was to create a group for the season number (that part works), and then try to match the episode numbers based on the repetition, but yeah, it's 3.41 AM here and I can't really wrap my head around it.

A complete solution would be nice, but any ideas or pointers are very much appreciated :-)

PS. I'll add a bounty if the accepted answer contains an explanation of the regex in order to help both myself and others learn.

Community
  • 1
  • 1
Michell Bak
  • 13,182
  • 11
  • 64
  • 121

3 Answers3

1

Season: 01 - Episodes: 01-02-03

The simple code at the bottom outputs the string above (as seen at the bottom of the Java demo). But you said you'd like some explanations, so we'll proceed step by step.

Step-By-Step

Let's first build a simple regex. Then we'll refine the output for your needs.

Search: ^.*?s(\d{2})((?:e\d{2})+)\..*

Replace: Season: $1 - Episodes: $2

Output: Season: 01 - Episodes: e01e02e03

In the regex101 demo, see the substitutions at the bottom. In the Java code below, we won't replace anything. This is just to see how things work.

Explaining the Match

  • ^ asserts that we are at the beginning of the string
  • .*? lazily matches characters, up to...
  • s(\d{2}) matches s, then the parentheses capture two digits to Group 1
  • The outer parentheses in ((?:e\d{2})+) define capture Group 2
  • The non-capturing group (?:e\d{2}) matches e and two digits, and
  • The + quantifier ensures we do that once or more, allowing us to capture all the episodes in to Group 2
  • \. matches the period before the extension
  • .* matches the end of the string

Explaining the Replacement

  • In the code below, we won't have the e between the episodes.
  • Season: writes the literal characters Season:
  • $1 is a back-reference to Group 1, and inserts the season
  • - Episodes: inserts the literal characters - Episodes:
  • $2 is a back-reference to Group 2, and inserts the episodes

Going Further: Dashes between Episode Numbers (or other refinements)

Let's say you want Season: 01 - Episodes: 01-02-03

This is not possible in a simple regex search and replace in a text editor, but it is easy in a programming language that allows you to use the capture groups of your match to build an output string.

Here is sample Java code (see the output at the bottom of the online demo):

String subject = "showname.s01e01e02e03.extension";
Pattern regex = Pattern.compile("^.*?s(\\d{2})((?:e\\d{2})+).*");
Matcher m = regex.matcher(subject);
String myoutput = "No Match"; // initialize
if (m.find()) {
    myoutput = "Season: " + m.group(1) +" - Episodes: " ;
    myoutput += m.group(2).substring(1,m.group(2).length()).replace("e", "-");
}
System.out.println(myoutput); 

How the Code Works

  • We use our regex from above
  • For our match, we build an output string in several steps
  • As in the simple demo, myoutput = "Season: " + m.group(1) +" - Episodes: " ; gives us some literal characters, Group 1 (the season), and more literal characters
  • For the episodes string, instead of using Group 2 directly (i.e. m.group(2)), we replace all the e characters with dashes: replace("e", "-")... But only starting after the first character, as we don't want to replace the first e with a dash: m.group(2).substring(1,m.group(2).length())
zx81
  • 41,100
  • 9
  • 89
  • 105
  • FYI added code to show how to get the output like this, without the `e` between the episodes: `Season: 01 - Episodes: 01-02-03` Let me know if you have any questions. :) – zx81 Jun 22 '14 at 02:33
  • Thanks a lot for (both) your answer(s)! While this does answer the question, Pshemo's answer adds some additional stuff that I needed (and was visible from my own attempt). Sorry about not being clear in the requirements - it was late :-/ – Michell Bak Jun 22 '14 at 11:58
  • `explanation of the regex in order to help both myself and others learn.` So... no bounties? :) – zx81 Jun 24 '14 at 02:48
  • Well, you really shouldn't ask for that kind of stuff. Anyway, the text you quote refers to the accepted answer. I've already commented on that, saying that I'd put up a bounty, to which Pshemo said it wasn't needed as the question wasn't that challenging. – Michell Bak Jun 24 '14 at 10:56
  • 1
    Thanks for explaining, but I still think it was a valid question. You did mention a bounty in the question statement, which motivated me to spend a lot of time on a detailed answer. Regardless of which answer you accepted, it's only natural I would be curious as to what happened to that initial incentive you had dangled in front of answerers. As to what I should or shouldn't do... Sorry, but that's not for you to say... That's my wife's business!... lol – zx81 Jun 24 '14 at 11:08
  • Sorry, I really don't mind putting up bounties or anything like that - in this specific case, the guy behind the accepted answer simply said that it wasn't needed. On a side note, I think people should always attempt to be as detailed as possible in their answers. Any given regex string might solve a problem, but it adds another problem if there's no explanation of the regex :-) – Michell Bak Jun 24 '14 at 11:13
  • Yes, I agree with that. See you another time. :) – zx81 Jun 24 '14 at 11:24
1

(I live in same timezone as you so my attempt may also be not accurate since I am half asleep but here I go)

If I understood you correctly (was also trying to analyse your regex attempt)

  • part sXXXXeXXXeXXX or sXXXXeXXX-XXX is always placed between dots
  • sXXXX can exist only once, but can have 1-4 digits (represented here by X),
  • there must be eXXX part after season information and at leas one of elements in form eXXX or -XXX (each can contain only 1-3 digits).

In that case you can use regex like

[.]s(?<season>\\d{1,4})e(?<episodes>\\d{1,3}([e-]\\d{1,3})+)[.]

which means

  • [.] dot literal

  • s(?<season>\\d{1,4}) will match sXXXX and store it in group called season

  • e literal placed after season info (seems mandatory from your examples)

  • (?<episodes>\\d{1,3}([e-]\\d{1,3})+) in this case

    • \\d{1,3} will match XXX
    • ([e-]\\d{1,3})+ and at least one of eXXX or -XXX after it.

    In other words it will match XXXeXXX, XXX-XXX or even something like XXXeXXX-XXX and place it in group named episodes

  • [.] dot literal placed after searched informations

If you will want to have some structure with separated list of episodes then you will just need to split match from group named episodes. Since this match can be in form XXXeXXX-XXX you can just split on e or - which can be represented by regex [e-] or e|-.

Demo:

String[] data = {
        "showname.s01e01e02e03.extension",
        "showname.s01e01-02-03.extension",
};
Pattern p = Pattern.compile(
                "[.]s(?<season>\\d{1,4})e(?<episodes>\\d{1,3}([e-]\\d{1,3})+)[.]",
                Pattern.CASE_INSENSITIVE);
for (String input : data){
    Matcher m = p.matcher(input);
    while (m.find()){
        String season = m.group("season");
        System.out.println(season);
        String episodes = m.group("episodes");
        System.out.println(m.group("episodes"));
        String[] singleEpisodes = episodes.split("[e-]");

        System.out.println("episode numbers"+Arrays.toString(singleEpisodes));
    }
    System.out.println("-----");
}

Output:

01
01e02e03
episode numbers[01, 02, 03]
-----
01
01-02-03
episode numbers[01, 02, 03]
-----
Pshemo
  • 122,468
  • 25
  • 185
  • 269
  • Thanks a lot! Yeah, the stuff you "extracted" from my attempt is spot on. Should've probably added that to the question. Anyway, this is the answer that's closest to what I'm trying to achieve, and it'll be easy to modify it a bit :-) Thanks! – Michell Bak Jun 22 '14 at 11:56
  • Will add bounty when possible. – Michell Bak Jun 22 '14 at 11:56
  • @MichellBak Glad it works for you. Anyway you don't need to add bounty since this question wasn't that challenging (at least not as much as [previous one](http://stackoverflow.com/questions/17098834/split-string-with-dot-while-handling-abbreviations) I answered) and probably will not be reused by many other people. Also you ware pretty close so I didn't change your regex that much. – Pshemo Jun 22 '14 at 14:08
  • Oh yeah, you answered my last one as well! Regex god! :-D – Michell Bak Jun 22 '14 at 14:15
  • [Nah](http://i2.kym-cdn.com/photos/images/original/000/345/169/bc7.png), I am just in good terms with regex but need lot to learn (and practice) to became god like [this guy](http://stackoverflow.com/a/3644267/1393766) for instance. – Pshemo Jun 22 '14 at 14:44
0

May as well take advantage of anyone else who has already done regex matching of episodes names. For example, see this page that discusses some advanced topics in regards to XBMC and how it matches episode names:

http://wiki.xbmc.org/index.php?title=Advancedsettings.xml#tvshowmatching

In case that link becomes stale in the future, some of the things mentioned are:

<tvshowmatching>
  <regexp>[Ss]([0-9]+)[][ ._-]*[Ee]([0-9]+)([^\\/]*)$</regexp>  <!-- foo.s01.e01, foo.s01_e01, S01E02 foo, S01 - E02 -->
  <regexp>[\._ -]()[Ee][Pp]_?([0-9]+)([^\\/]*)$</regexp>  <!-- foo.ep01, foo.EP_01 -->
  <regexp>([0-9]{4})[\.-]([0-9]{2})[\.-]([0-9]{2})</regexp>  <!-- foo.yyyy.mm.dd.* (byDate=true) -->
  <regexp>([0-9]{2})[\.-]([0-9]{2})[\.-]([0-9]{4})</regexp>  <!-- foo.mm.dd.yyyy.* (byDate=true) -->
  <regexp>[\\/\._ \[\(-]([0-9]+)x([0-9]+)([^\\/]*)$</regexp>  <!-- foo.1x09* or just /1x09* -->
  <regexp>[\\/\._ -]([0-9]+)([0-9][0-9])([\._ -][^\\/]*)$</regexp>  <!-- foo.103*, 103 foo -->
  <regexp>[\/._ -]p(?:ar)?t[_. -]()([ivx]+)([._ -][^\/]*)$</regexp>  <!-- Part I, Pt.VI -->
</tvshowmatching>

Note that XBMC is just a starting point. I'd look up all those similar types of software packages to see what regex they eventually decided to use, as they've already put a lot of thought into it.

Stéphane
  • 19,459
  • 24
  • 95
  • 136
  • Thanks. My app is actually a competitor to XBMC, so I'd rather not use their code base :-) Besides, that part doesn't deal with multi-episode files. – Michell Bak Jun 22 '14 at 11:50