8

I have a M3U playlist that looks something like this:

#EXTM3U
#EXTINF:-1 tvg-id="wsoc.us" tvg-name="ABC 9 (Something) (WSOC)" tvg-logo="" group-title="US Locals",ABC 9 (Something) WSOC (WSOC) 
http://some.url/1
#EXTINF:-1 tvg-id="wbtv.us" tvg-name="CBS 3 WBTV (WBTV)" tvg-logo="" group-title="US Locals",CBS 3 WBTV (WBTV)
http://some.url/2
#EXTINF:-1 tvg-id="wcnc.us" tvg-name="NBC (Hey) 36 WCNC (WCNC)" tvg-logo="" group-title="US Locals (Something here)",NBC 36 (Hey) WCNC (WCNC)
http://some.url/3
#EXTINF:-1 tvg-id="wjzy.us" tvg-name="FOX 46 WJZY (Shout Out) (WJZY)" tvg-logo="" group-title="US Locals",FOX 46 WJZY (Shout Out) (WJZY)
http://some.url/4

I'm looking to get the last entry in the tvg-name field without the parenthesis - for example, WSOC and WBTV and WCNC, etc.

This works:

grep -Po 'tvg-name=\".*?\"'  Playlist.m3u | awk -F'(' '{print $NF}' | cut -f1 -d")" | sort -u

But I know there has got to be a better than using grep, awk, and cut. It's been driving me nuts.

Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
  • 1
    There's m3u parsers out there – Gilles Quénot Apr 20 '23 at 20:15
  • your code includes a `sort -u` on the end; can the input file contain duplicate `tvg-name` fields? if the answer is 'yes', please update the question a) to include an input with a duplicate `tvg-name`, b) to show the expected output and c) to explicitly mention that you want a unique list; also ... do you need the final output to be sorted? – markp-fuso Apr 20 '23 at 20:44
  • Is the trailing space in row two intentional? – Cyrus Apr 20 '23 at 20:46
  • Having exactly the same target string appear twice in each line (e.g. on line 2 `WSOC` appears in `tvg-name="ABC 9 (Something) (WSOC)"` and also in `group-title="US Locals",ABC 9 (Something) WSOC (WSOC)`) makes this a bad test case as a script that doesn't do what you want could produce the output you want. – Ed Morton Apr 21 '23 at 00:39

7 Answers7

4

You can make both ( and ) field separators, so you don't need the last cut.

You don't need to escape double quotes inside single-quoted strings.

grep -Po 'tvg-name=".*?"'  Playlist.m3u | awk -F'[()]' '{print $(NF-1)}'

If you're using GNU awk you can also use a capture group to get the tvg-name=".*" part, so you don't need grep.

awk 'match($0, /tvg-name="[^"]*\(([^)]*)/, m) { print m[1] }' Playlist.m3u

See AWK: Access captured group from line pattern

Barmar
  • 741,623
  • 53
  • 500
  • 612
4

With GNU awk. Use ( and ) as field separator and print in every row which contains tvg-name second last field ($(NF-1)).

awk -F '[()]' '/tvg-name/{print $(NF-1)}' Playlist.m3u

Output:

WSOC
WBTV
WCNC
WJZY
Cyrus
  • 84,225
  • 14
  • 89
  • 153
  • FYI that'd behave the same with any awk, it's not GNU-specific. It's printing the duplicate string from the end of the line though, not the last string in the tvg-name field. idk if that matters to the OP or not. – Ed Morton Apr 21 '23 at 00:31
4

Using just a regex with GNU grep:

grep -oP 'tvg-name.*\(\K\w+(?=\))' /tmp/file.m3u

The regular expression matches as follows:

Node Explanation
tvg-name 'tvg-name'
.* any character except \n (0 or more times (matching the most amount possible))
\( (
\K resets the start of the match (what is Kept) as a shorter alternative to using a look-behind assertion: look arounds and Support of K in regex
\w+ word characters (a-z, A-Z, 0-9, _) (1 or more times (matching the most amount possible))
(?= look ahead to see if there is:
\) )
) end of look-ahead

Or using a proper m3u parser:

Need to install CPAN module

cpan Parse::M3U::Extended 

:

#!/usr/bin/env perl

use strict; use warnings;

use Parse::M3U::Extended qw(m3u_parser);
use File::Slurp;
use feature 'say';
my $m3u = read_file('/tmp/file.m3u');
my @items = m3u_parser($m3u);

foreach my $item (@items) {
    if ($item->{type} eq "directive" and $item->{tag} eq "EXTINF") {
        $_ = $item->{value};
        s/.*\((\w+)\)/$1/;
        say;
    }
}

This have the advantage to be reusable for other use-cases in a reliable way, that is not the case with random awk, sed etc...

Output:

WSOC 
WBTV
WCNC
WJZY
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
3

With your shown samples and attempts please try following awk code. Should work in any POSIX awk version. using match function with substr capability here. Using regex tvg-name=".*\([^)]* to match values and out of the printing only required ones.

awk '
match($0,/ tvg-name=".*\([^)]*/){
  val=substr($0,RSTART,RLENGTH)
  sub(/.*\(/,"",val)
  print val
}
'  Input_file
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
2

Using any sed in any shell on every Unix box:

$ sed -n 's/.*tvg-name="[^"]*(\([^)]*\).*/\1/p' file
WSOC
WBTV
WCNC
WJZY
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
1

you can do it the really complex perl or sed ways, with capturing groups and look-aheads and boundary assertions and what not,

or you can do it the awk way :

mawk 'NF *= /tvg-name/' FS='.+[(]|[)][^(]*$' OFS=

gawk '$_ = $(NF -= NF^!/tvg-name/)' FS='[#-?]+' 
WSOC
WBTV
WCNC
WJZY

if you're ABSOLUTELY sure nothing is beyond that final ), then even cleaner :

mawk 'NF *= /tvg-name/' FS='.+[(]|.$' OFS=

if you modify the regex by just a tiny bit, you can even get the broadcast networks ...

awk '(NF*=/tvg-name/) && $-_=$--NF' FS='[#-?]+' 
WSOC
WBTV
WCNC
WJZY
  • ... simply by changing that (#) to a (+) ...
awk '(NF*=/tvg-name/) && $-_=$--NF' FS='[+-?]+'  # old solution

awk 'NF*=/tvg-name/' FS='.+,| .+$' OFS=          # more succinct
ABC 
CBS 
NBC 
FOX 

or change it slightly for easy capture of components without needing a dedicated M3U parser or using capture groups :

awk 'NF*=/tvg-name/' FS='.+,| ([(][^)]+[)] ?)*' OFS='|'
|ABC|9|WSOC|
|CBS|3|WBTV|
|NBC|36|WCNC|
|FOX|46|WJZY|
RARE Kpop Manifesto
  • 2,453
  • 3
  • 11
  • 1
    Small question, What does this part do? `$-_=$--NF` – The fourth bird May 01 '23 at 17:42
  • 1
    @Thefourthbird : since i never defined `_`, it's the so called `awk null object` which is simultaneously a numeric `0` and an empty string `""`. `$-_` forces it to be numeric `-0` (negative zero), so that whole thing is for overwriting `$0` so it would print out the new column value, and is identical to `$+_=$--NF`. `TL;DR` it's a short hand way to write `{ print $--NF }`. The `"-"` is only needed because `nawk` gives a strange fatal error message while `mawk` refuses to print without it. both are out of compliance with `POSIX`. if you're on `gawk`, just `$_=$--NF` suffices. – RARE Kpop Manifesto May 02 '23 at 02:06
  • 1
    @Thefourthbird : doing a decrement to `NF` is safe here since the first part of the `&&` (logical `AND`) condition already guarantees `0 < NF` …. which reminded me i was being too verbose with my answer – RARE Kpop Manifesto May 02 '23 at 02:24
0

Using gnu awk, if you want to get the last entry in the tvg-name field, you could use a negated character class to not match a double quote and then capture the value between the last occurrence of the parenthesis using a capture group and the match function:

The pattern matches:

  • \y A word boundary to prevent a partial word match
  • tvg-name=" Match literally
  • [^"]* Match optional characters other than "
  • \( Match (
  • (\w+) Capture 1+ word characters in group 1
  • \) Match )

For example

awk 'match($0, /\ytvg-name="[^"]*\((\w+)\)/, a) {print a[1]}' Playlist.m3u

Output

WSOC
WBTV
WCNC
WJZY

Another variation matching any character between the parenthesis using a negated character class:

awk 'match($0, /\ytvg-name="[^"]*\(([^()]+)\)/, a) {print a[1]}' Playlist.m3u
The fourth bird
  • 154,723
  • 16
  • 55
  • 70