0

I want to extract season and episode from a filename in C. For example, if the input string is "Game.of.Thrones.S05E02.720p.HDTV.x264-IMMERSE.mkv", then I want to extract the substring "S05E02" out of it.

At the moment, I'm using a very naive approach for matching characters one at a time. Concretely, I am finding 'S' and then checking if the next two characters are both numbers between '0' and '9' and then the subsequent character is 'E' and the next two characters to 'E' are also between '0' and '9'.

// Return index if pattern found. Return -1 otherwise

int get_tvshow_details(const char filename[])
{
  unsigned short filename_len = strlen(filename);
  for (int i = 0; i < filename_len-5; ++i) {
    char season_prefix = filename[i];
    char episode_prefix = filename [i+3];
    char season_left_digit = filename[i+1];
    char season_right_digit = filename[i+2];
    char episode_left_digit = filename[i+4];
    char episode_right_digit = filename[i+5];

    if ((season_prefix == 'S' || season_prefix == 's') 
        && (episode_prefix == 'E' || episode_prefix == 'e')
        && (season_left_digit >= '0' && season_left_digit <= '9') 
        && (season_right_digit >= '0' && season_right_digit <= '9')
        && (episode_left_digit >= '0' && episode_left_digit <= '9') 
        && (episode_right_digit >= '0' && episode_right_digit <= '9')) {

      printf("match found at %d\n", i);
      return i;
    }
  }
  return -1;
}

Is there a more efficient way in C to extract the following pattern: S<2_digit_number>E<2_digit_number> from any tv show filename?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Ronak Sharma
  • 109
  • 6
  • You'll need to show the code for the "very naive approach" you're using or we'll be unable to help make improvements to it. – tadman Jul 03 '20 at 20:53
  • 1
    @tadman Sorry I should have mentioned it in my post. I've added it now. Can you please check it? – Ronak Sharma Jul 03 '20 at 21:11
  • This is tagged regular expression so I was expecting to see one of those. This could be condensed a lot using things like [`isdigit()`](https://en.cppreference.com/w/c/string/byte/isdigit) or a regular expression library. – tadman Jul 03 '20 at 21:19
  • 1
    If you're doing it all in basic C I'd recommend implementing a very simple state machine rather than brute-force testing like this. These are pretty easy to write and can be made very flexible. – tadman Jul 03 '20 at 21:20
  • 1
    Why are you worried about "most efficient"? Have you benchmarked your process and found that parsing filenames is a bottleneck in your processing? – Andrew Henle Jul 03 '20 at 21:20
  • C really isn't a very convenient language for this task... have you considered using something else? – Nate Eldredge Jul 03 '20 at 21:22
  • _Side note:_ I do this all the time [and _much_ more]. But, I use a scripting language that support regexes (e.g. `perl`) If you really want to do it in C, a regex library (e.g. `pcre`) might be able to help. – Craig Estey Jul 03 '20 at 21:23
  • @CraigEstey I have already written a much more robust version of this in python. However, I wanted to rewrite the code in C for benchmarking purposes. – Ronak Sharma Jul 03 '20 at 21:27
  • 1
    Unless you're processing tens of millions of files this will be adequately fast in Python. – tadman Jul 03 '20 at 21:28
  • @NateEldredge I have already written this in python. I just wanted a C version for benchmarking purposes. – Ronak Sharma Jul 03 '20 at 21:29
  • @tadman I agree. I just wanted to see how many times faster C can be for this job. But yeah, my python version does this task for 100 files in around 200-300 milliseconds. Not worth writing in C. But I'm writing it just as a fun project and checking benchmarks against thousands of files. – Ronak Sharma Jul 03 '20 at 21:33
  • @tadman Thanks for your answer. Do you think regular expressions would be faster than regular string operations for a trivial task like this? – Ronak Sharma Jul 03 '20 at 21:35
  • The problem is that the bottleneck is unlikely to be pattern matching, but pulling the directory contents itself. Python's regular expressions are going to be nearly as fast as those in C. While regular expressions are not necessarily "faster", they are a lot easier to write. `/S\d\dE\d\d/` is a handful of characters and does what your ~20 lines of code does. – tadman Jul 03 '20 at 21:40
  • _If_ all filenames are of the same format, you might even consider `sscanf`. But, the trick is to scan the string left-to-right _once_. Can you rely on the fact that the major sections are separated by `.`? If so, `strtok` might be an option. Or, just repeated `cp = strchr(cp,'.')` – Craig Estey Jul 03 '20 at 21:40
  • What else is guaranteed? (e.g.) Do you always have (e.g.) `720p` (or `1080p`). I'd refrain from the hardcoded offsets you have [from the back of the string]--they're not too robust if you hit a filename that isn't quite conforming (e.g.): `Game.of.Thrones.S05E02.mpeg` – Craig Estey Jul 03 '20 at 21:45
  • @tadman I've decided to go with regular expressions since they're a lot cleaner to write and they won't cause a bottleneck to my code. I'm currently using `[S,s][0-9]{1,2}[E,e][0-9]{1,2}`. Thanks for your help! – Ronak Sharma Jul 03 '20 at 22:05
  • @CraigEstey Thanks for your help especially for pointing out the robustness factor. I've decided to use regex `[S,s][0-9]{1,2}[E,e][0-9]{1,2}`. Do you think this would suffice? – Ronak Sharma Jul 03 '20 at 22:08
  • Why not `\d` instead of the more verbose `[0-9]`? They're the same thing. Also `[S,s]` matches either `S`, `s` *or* `,` as well. Just do `[Ss]`, or even better, set the regular expression to be case-insensitive. – tadman Jul 03 '20 at 22:29
  • @tadman Silly me. I made it `[Ss]`. Even after using `REG_ICASE` cflag in the `regcomp` function of `regex.h`, it's not working it is suppossed to. So I'll stick with `[Ss]`. Also `\d` is not working in my `regexec` so i'll have to stick with `[0-9]`. Thanks for your help! – Ronak Sharma Jul 03 '20 at 23:17
  • `([Ss](\d*))([Ee](\d*))` So it might be little bit more helpful than original regex. Some small fix suggested by @tadman with digits mark, and also this regex points for specific groups and cover cases like S01E129. With input of _Game.of.Thrones.S05E02.720p.HDTV.x264-IMMERSE.mkv_ Full Match = S05E02, Group 1 = S05, Group 2 = 05, Group 3 = E02, Group 4 = 02, – Azelski Jul 04 '20 at 04:06

1 Answers1

1

I'd like to propose another solution, very similar to regex, but not dependent on a separate library for regex. C's format strings are quite powerful, though primitive. I think they could actually work in this case.

The format string we'll need is- %*[^.].%*[^.].%*[^.].%*1[Ss]%d%*1[Ee]%d.

Let's compare this to a string like Game.of.Thrones.S05E02.720p.HDTV.x264-IMMERSE.mkv

  • The first %*[^.]. will consume Game. but not capture it.

  • The second %*[^.]. will consume of. but not capture it.

  • The second %*[^.]. will consume Thrones. but not capture it.

  • Now the fun part, %*1[Ss]%d%*1[Ee]%d. is designed to capture S05E02., and also extract the 05 and 02 into integer variables. Let's discuss this.

    • %*1[Ss] will consume only 1 letter that is either S or s but not capture it
    • %d will consume the digits afterwards (05 in this case) and store it into an integer
    • %*1[Ee] will consume only 1 letter that is either E or e but not capture it
    • Finally, %d. will consume the digits afterwards, store it inside an integer and capture the . right after.

If used properly, it should look like-

// Just a dummy string literal for testing
char s[] = "Game.of.Thrones.S05E02.720p.HDTV.x264-IMMERSE.mkv";
// Variables to store the numbers in
int seas, ep;
printf("%d\n", sscanf(s, "%*[^.].%*[^.].%*[^.].%*1[Ss]%d%*1[Ee]%d.", &seas, &ep));

You may notice, we're also printing the return value of sscanf (you don't have to print it, you can just store it). This is very important. If sscanf returns 2 (that is, the number of captured variables), you know that it was a successful match and the provided string is indeed valid. If it returns anything else, it indicates either non-complete match or a complete failure (in case of negative values).

If you run this piece of code, you get-

2

Which is correct. If you print seas and ep later, you get-

5 2
Chase
  • 5,315
  • 2
  • 15
  • 41
  • This is exactly what I wanted. Thanks a lot for the solution and for explaining it very well! I have a quick question. Are the execution speeds of C's format strings comparable to that of using regex for the same job? – Ronak Sharma Jul 04 '20 at 05:23
  • @RonakSharma if anything, it would be faster. Speed will certainly not be an issue here. – Chase Jul 04 '20 at 05:34
  • Thanks a lot! I'll definitely read more about format strings. They seem pretty useful in C. – Ronak Sharma Jul 04 '20 at 06:42
  • I tried using your format specifier but it only applies to strings like "Game.of.Thrones.S05E02.720p.mkv" where there are 3 words, each followed by a `.` . However, it won't apply to strings like "Silicon Valley.S05E02.720p.mkv" where there are 2 words and spaces instead of `.` between words. Can you please tell me what modications can i make to the format specifier mentioned in your answer, to make it apply to generic cases where there is any arbitrary string preceeding `SE` ? – Ronak Sharma Jul 17 '20 at 07:24
  • 1
    @RonakSharma If the same tokens are present in both the identifier that you use to stop the match (`.`) and the match itself, you'll require a non greedy match. Unfortunately C format strings do not have non greedy quantifiers. Your best chances in that case would be to use actual regex. See [this answer](https://stackoverflow.com/questions/20239817/posix-regular-expression-non-greedy) – Chase Jul 17 '20 at 08:14