Better regular expression to get a value in parenthesis

Question

I have a M3U playlist that looks something like this:

#EXTM3U
#EXTINF:-1 tvg-id="wsoc.us" tvg-name="ABC 9 (Something) (WSOC)" tvg-logo="" group-title="US Locals",ABC 9 (Something) WSOC (WSOC) 
http://some.url/1
#EXTINF:-1 tvg-id="wbtv.us" tvg-name="CBS 3 WBTV (WBTV)" tvg-logo="" group-title="US Locals",CBS 3 WBTV (WBTV)
http://some.url/2
#EXTINF:-1 tvg-id="wcnc.us" tvg-name="NBC (Hey) 36 WCNC (WCNC)" tvg-logo="" group-title="US Locals (Something here)",NBC 36 (Hey) WCNC (WCNC)
http://some.url/3
#EXTINF:-1 tvg-id="wjzy.us" tvg-name="FOX 46 WJZY (Shout Out) (WJZY)" tvg-logo="" group-title="US Locals",FOX 46 WJZY (Shout Out) (WJZY)
http://some.url/4

I'm looking to get the last entry in the tvg-name field without the parenthesis - for example, WSOC and WBTV and WCNC, etc.

This works:

grep -Po 'tvg-name=\".*?\"'  Playlist.m3u | awk -F'(' '{print $NF}' | cut -f1 -d")" | sort -u

But I know there has got to be a better than using grep, awk, and cut. It's been driving me nuts.

your code includes a `sort -u` on the end; can the input file contain duplicate `tvg-name` fields? if the answer is 'yes', please update the question a) to include an input with a duplicate `tvg-name`, b) to show the expected output and c) to explicitly mention that you want a unique list; also ... do you need the final output to be sorted? — markp-fuso, Apr 20 '23 at 20:44
Having exactly the same target string appear twice in each line (e.g. on line 2 `WSOC` appears in `tvg-name="ABC 9 (Something) (WSOC)"` and also in `group-title="US Locals",ABC 9 (Something) WSOC (WSOC)`) makes this a bad test case as a script that doesn't do what you want could produce the output you want. — Ed Morton, Apr 21 '23 at 00:39

Barmar · Answer 1 · 2023-04-20T20:20:01.673

You can make both ( and ) field separators, so you don't need the last cut.

You don't need to escape double quotes inside single-quoted strings.

grep -Po 'tvg-name=".*?"'  Playlist.m3u | awk -F'[()]' '{print $(NF-1)}'

If you're using GNU awk you can also use a capture group to get the tvg-name=".*" part, so you don't need grep.

awk 'match($0, /tvg-name="[^"]*\(([^)]*)/, m) { print m[1] }' Playlist.m3u

See AWK: Access captured group from line pattern

score 4 · Answer 2 · answered Apr 20 '23 at 20:26

4

With GNU awk. Use ( and ) as field separator and print in every row which contains tvg-name second last field ($(NF-1)).

awk -F '[()]' '/tvg-name/{print $(NF-1)}' Playlist.m3u

Output:

WSOC
WBTV
WCNC
WJZY

answered Apr 20 '23 at 20:26

Cyrus

84,225
14
89
153

FYI that'd behave the same with any awk, it's not GNU-specific. It's printing the duplicate string from the end of the line though, not the last string in the tvg-name field. idk if that matters to the OP or not. – Ed Morton Apr 21 '23 at 00:31

Gilles Quénot · Accepted Answer · 2023-04-20T21:43:02.190

Using just a regex with `GNU` `grep`:

grep -oP 'tvg-name.*\(\K\w+(?=\))' /tmp/file.m3u

The regular expression matches as follows:

Node	Explanation
`tvg-name`	'tvg-name'
`.*`	any character except \n (0 or more times (matching the most amount possible))
`\(`	(
`\K`	resets the start of the match (what is `K`ept) as a shorter alternative to using a look-behind assertion: look arounds and Support of K in regex
`\w+`	word characters (a-z, A-Z, 0-9, _) (1 or more times (matching the most amount possible))
`(?=`	look ahead to see if there is:
`\)`	)
`)`	end of look-ahead

Or using a proper m3u parser:

Need to install CPAN module

cpan Parse::M3U::Extended

:

#!/usr/bin/env perl

use strict; use warnings;

use Parse::M3U::Extended qw(m3u_parser);
use File::Slurp;
use feature 'say';
my $m3u = read_file('/tmp/file.m3u');
my @items = m3u_parser($m3u);

foreach my $item (@items) {
    if ($item->{type} eq "directive" and $item->{tag} eq "EXTINF") {
        $_ = $item->{value};
        s/.*\((\w+)\)/$1/;
        say;
    }
}

This have the advantage to be reusable for other use-cases in a reliable way, that is not the case with random awk, sed etc...

Output:

WSOC 
WBTV
WCNC
WJZY

The single grep line is exactly what I was looking for. For this particular use case I don't need an M3U parser. Thanks so much! — Dan Marcoux, Apr 20 '23 at 21:05
@GillesQuénot : `m3u8` is such a straight forward format I'm surprised people wrote modules to """"parse"""" it — RARE Kpop Manifesto, Apr 21 '23 at 11:43

score 3 · Answer 4 · answered Apr 22 '23 at 04:00

With your shown samples and attempts please try following awk code. Should work in any POSIX awk version. using match function with substr capability here. Using regex tvg-name=".*\([^)]* to match values and out of the printing only required ones.

awk '
match($0,/ tvg-name=".*\([^)]*/){
  val=substr($0,RSTART,RLENGTH)
  sub(/.*\(/,"",val)
  print val
}
'  Input_file

score 2 · Answer 5 · answered Apr 21 '23 at 00:37

2

Using any sed in any shell on every Unix box:

$ sed -n 's/.*tvg-name="[^"]*(\([^)]*\).*/\1/p' file
WSOC
WBTV
WCNC
WJZY

answered Apr 21 '23 at 00:37

Ed Morton

188,023
17
78
185

RARE Kpop Manifesto · Answer 6 · 2023-05-02T02:45:02.177

1

you can do it the really complex perl or sed ways, with capturing groups and look-aheads and boundary assertions and what not,

or you can do it the awk way :

mawk 'NF *= /tvg-name/' FS='.+[(]|[)][^(]*$' OFS=

gawk '$_ = $(NF -= NF^!/tvg-name/)' FS='[#-?]+'

WSOC
WBTV
WCNC
WJZY

if you're ABSOLUTELY sure nothing is beyond that final ), then even cleaner :

mawk 'NF *= /tvg-name/' FS='.+[(]|.$' OFS=

if you modify the regex by just a tiny bit, you can even get the broadcast networks ...

awk '(NF*=/tvg-name/) && $-_=$--NF' FS='[#-?]+'

WSOC
WBTV
WCNC
WJZY

... simply by changing that (#) to a (+) ...

awk '(NF*=/tvg-name/) && $-_=$--NF' FS='[+-?]+'  # old solution

awk 'NF*=/tvg-name/' FS='.+,| .+$' OFS=          # more succinct

ABC 
CBS 
NBC 
FOX

or change it slightly for easy capture of components without needing a dedicated M3U parser or using capture groups :

awk 'NF*=/tvg-name/' FS='.+,| ([(][^)]+[)] ?)*' OFS='|'

|ABC|9|WSOC|
|CBS|3|WBTV|
|NBC|36|WCNC|
|FOX|46|WJZY|

edited May 02 '23 at 02:45

answered Apr 21 '23 at 11:22

RARE Kpop Manifesto

2,453
3
11

1

Small question, What does this part do? `$-_=$--NF` – The fourth bird May 01 '23 at 17:42
1

@Thefourthbird : since i never defined `_`, it's the so called `awk null object` which is simultaneously a numeric `0` and an empty string `""`. `$-_` forces it to be numeric `-0` (negative zero), so that whole thing is for overwriting `$0` so it would print out the new column value, and is identical to `$+_=$--NF`. `TL;DR` it's a short hand way to write `{ print $--NF }`. The `"-"` is only needed because `nawk` gives a strange fatal error message while `mawk` refuses to print without it. both are out of compliance with `POSIX`. if you're on `gawk`, just `$_=$--NF` suffices. – RARE Kpop Manifesto May 02 '23 at 02:06
1

@Thefourthbird : doing a decrement to `NF` is safe here since the first part of the `&&` (logical `AND`) condition already guarantees `0 < NF` …. which reminded me i was being too verbose with my answer – RARE Kpop Manifesto May 02 '23 at 02:24

score 0 · Answer 7 · answered May 01 '23 at 17:42

Using gnu awk, if you want to get the last entry in the tvg-name field, you could use a negated character class to not match a double quote and then capture the value between the last occurrence of the parenthesis using a capture group and the match function:

The pattern matches:

\y A word boundary to prevent a partial word match
tvg-name=" Match literally
[^"]* Match optional characters other than "
\( Match (
(\w+) Capture 1+ word characters in group 1
\) Match )

For example

awk 'match($0, /\ytvg-name="[^"]*\((\w+)\)/, a) {print a[1]}' Playlist.m3u

Output

WSOC
WBTV
WCNC
WJZY

Another variation matching any character between the parenthesis using a negated character class:

awk 'match($0, /\ytvg-name="[^"]*\(([^()]+)\)/, a) {print a[1]}' Playlist.m3u

Better regular expression to get a value in parenthesis

7 Answers7

Using just a regex with GNU grep:

The regular expression matches as follows:

Or using a proper m3u parser:

Output:

Using just a regex with `GNU` `grep`: