0

I am trying to parse HLS m3u8 file and where am stuck at is matching m3u8 links. So, if URI= exists, from #EXT-X-I-FRAME-STREAM-INF, grab the one in quotation marks, and if it doesn't, #EXT-X-STREAM-INF, grab the link from new line.

Text:

#EXT-X-STREAM-INF:BANDWIDTH=263851,CODECS="mp4a.40.2, avc1.4d400d",RESOLUTION=416x234,AUDIO="bipbop_audio",SUBTITLES="subs"
gear1/prog_index.m3u8 <== new line link
#EXT-X-I-FRAME-STREAM-INF:URI="gear1/iframe_index.m3u8",CODECS="avc1.4d400d",BANDWIDTH=28451

enter image description here

Regex:

(?:#EXT-X-STREAM-INF:|#EXT-X-I-FRAME-STREAM-INF:)(?:BANDWIDTH=(?<BANDWIDTH>\d+),?|CODECS=(?<CODECS>"[^"]*"),?|RESOLUTION=(?<RESOLUTION>\d+x\d+),?|AUDIO=(?<AUDIO>"[^"]*"),?|SUBTITLES=(?<SUBTITLES>"[^"]*"),?|URI=(?<URI>"[^"]*"),?)*

Regex demo

Srdjan M.
  • 3,310
  • 3
  • 13
  • 34

1 Answers1

1

A quick fix for your pattern will look like this:

  • Capture the #EXT-X-STREAM-INF part into Group 1
  • Add (?J) modifier to allow named capturing groups with identical names
  • Add a conditional construct that will capture the whole line after the current pattern if Group 1 matched.

The pattern will look like

(?J)(?:(#EXT-X-STREAM-INF)|#EXT-X-I-FRAME-STREAM-INF):(?:BANDWIDTH=(?<BANDWIDTH>\d+),?|CODECS=(?<CODECS>"[^"]*"),?|RESOLUTION=(?<RESOLUTION>\d+x\d+),?|AUDIO=(?<AUDIO>"[^"]*"),?|SUBTITLES=(?<SUBTITLES>"[^"]*"),?|URI=(?<URI>"[^"]*"),?)*(?<URI>(?:(?!#EXT)\S)+))

See the regex demo

So, basically, I added (?<URI>(?:(?!#EXT)\S)+)) at the end and captured (#EXT-X-STREAM-INF) at the start.

The conditional construct matches like this:

  • (? - start of the conditional construct
    • (1) - if Group 1 matched
    • \R - a line break
    • (?<URI> - start of a named capturing group
      • (?:(?!#EXT)\S)+) - any non-whitespace char (\S), 1 or more occurrences (+), that is not a starting char of a #EXT char sequence (the so called "tempered greedy token")
    • ) - end of the named capturing group
  • ) - end of the conditional construct
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • The only thing I am concern about is if #EXT-X-STREAM-INF and #EXT-X-I-FRAME-STREAM-INF are in same line. #EXT-X-STREAM-INF:BANDWIDTH=263851...#EXT-X-I-FRAME-STREAM-INF:URI="gear1/... – Srdjan M. Nov 04 '17 at 19:14
  • Changed from '(?(1)\R(?.*))' to '(?(1)\R(?[^#EXT\s]+))' and it works like a charm. https://regex101.com/r/9OPleo/2 Again, thank you. – Srdjan M. Nov 04 '17 at 19:42
  • 1
    @S.Kablar Sorry, I was offline. You cannot use a negated character class to match any text but a sequence of chars. You may use a tempered greedy token, [`(?(?:(?!#EXT)\S)+))`](https://regex101.com/r/9OPleo/3). – Wiktor Stribiżew Nov 04 '17 at 21:38