1

I try to use MAWK where the match() built-in function doesn't have a third value for variable:

match($1, /9f7fde/) {
  substr($1, RSTART, RLENGTH);
}

See doc.

How can I store this output into a variable named var when later I want to construct my output like this?

EDIT2 - Complete example:

Input file structure:

<iframe src="https://vimeo.com/191081157" frameborder="0" height="481" width="608" scrolling="no"></iframe>|Random title|Uploader|fun|tag1,tag2,tag3
<iframe src="https://vimeo.com/212192268" frameborder="0" height="481" width="608" scrolling="no"></iframe>|Random title|Uploader|fun|tag1,tag2,tag3

parser.awk:

{
  Embed = $1;
  Title = $2;
  User = $3;
  Categories = $4;
  Tags = $5;
}

BEGIN {
  FS="|";
}

# Regexp without pattern matching for testing purposes
match(Embed, /191081157/) {
  Id = substr(Embed, RSTART, RLENGTH);
}

{
  print Id"\t"Title"\t"User"\t"Categories"\t"Tags;
}

Expected output:

191081157|Random title|Uploader|fun|tag1,tag2,tag3

I want to call the Id variable outside the match() function.

MAWK version:

mawk 1.3.4 20160930
Copyright 2008-2015,2016, Thomas E. Dickey
Copyright 1991-1996,2014, Michael D. Brennan

random-funcs:       srandom/random
regex-funcs:        internal
compiled limits:
sprintf buffer      8192
maximum-integer     2147483647
Lanti
  • 2,299
  • 2
  • 36
  • 69
  • So, what's the expected output? btw, I couldn't find your example search string `9f7fde` in your "input file structure", – James Brown Nov 10 '16 at 23:15
  • Expected output is the only line that have `191081157` in the first column `$1` or `Embed`. Developing regexp pattern to only return the string after `vimeo.com/` and before `"` is outside of this question's scope. The upper example not work even with the hardcoded string. – Lanti Nov 10 '16 at 23:17
  • Expected output added. – Lanti Nov 10 '16 at 23:21
  • You are CALLING match(), not DEFINING match(). The code in the curly brackets is not inside the match() function, it's in an action block that's executed if the match() function returns true. When you do `match($1, /9f7fde/)` then `substr($1, RSTART, RLENGTH)` just contains the string `9f7fde` so that's pretty pointless. What exactly are you trying to do? – Ed Morton Nov 11 '16 at 02:01
  • That string is here for simplicity of the example, later on that will be a regex capture group that will return exactly this part from the urls from every line. I tried to provide the code to the bare minimum problem as near as possible, because someone's job here is to downvote every single question that's not strictly about the problem. I quess only sharing the problem (the part that I thought didn't worked and it) also adding closing flags and downvotes. Either way, you always loose. – Lanti Nov 11 '16 at 09:39

2 Answers2

1

The obvious answer would seem to be

match($1, /9f7fde/) { var = "9f7fde"; }

But more general would be:

match($1, /9f7fde/) { var = substr($1, RSTART, RLENGTH); }
rici
  • 234,347
  • 28
  • 237
  • 341
  • The second option is work if `print var;` is inside `match()` function. But I want to use this variable outside `match()`. is it possible with MAWK? (With GAWK match() is possible, but because of speed I try to do that with MAWK.) – Lanti Nov 10 '16 at 22:20
  • 1
    @lanti: I have no idea what you mean by "inside" or "outside" match(). Match is a built-in function and everything inside it is part of the implementation. If you mean "inside the action guarded by the match() call," then the variable is certainly usable afterwards, since awk blocks are not scoped. If you mean something else please *edit your question* with a clear example. – rici Nov 10 '16 at 22:28
  • @Lanti: It works here with gawk, mawk, nawk and busybox awk. Please provide a [Minimal, Complete, Verifiable Example](http://stackoverflow.com/help/mcve) – Thor Nov 10 '16 at 22:47
  • Full example provided. – Lanti Nov 10 '16 at 22:56
  • @Lanti: No, you have not provided any sample input where your code fails – Thor Nov 10 '16 at 23:07
  • Sample CSV provided. – Lanti Nov 10 '16 at 23:14
  • Thank You! With bare minimal example the code it worked. Probably worked on my first try, but I got empty lines, so the problem somewhere in the regex match. – Lanti Nov 10 '16 at 23:34
  • Probably what tricked me is that I using hardcoded regex string for match (testing purposes), like the numbers in my first code. This way I've got empty strings for `var` in millions of lines... I know, It's late... – Lanti Nov 10 '16 at 23:41
  • It's completely pointless to do what you are doing with match() and substr(). – Ed Morton Nov 11 '16 at 02:03
0

UPDATE : The solution above mine could be simplified to :

from

match($1, /9f7fde/) { var = substr($1, RSTART, RLENGTH) }

to

{ __=substr($!_,match($!_,"9f7fde"),RLENGTH) }

A failed match would have RLENGTH auto set to -1, so nothing gets substring'ed out.

But even that is too verbose : since the matching criteria is a constant string, then simply

mawk '$(_~_)~_{__=_}' \_='9f7fde'

============================================

let's say this line

.....vimeo.com/191081157" frameborder="0" height="481" width="608" scrolling="no">Random title|Uploader|fun|tag1,tag2,tag3

{mawk/mawk2/gawk} 'BEGIN { OFS = "";

         FS = "(^.+vimeo[\056]com[\057]|[\042] frameborder.+[\057]iframe[>])" ; 
    
     } (NF < 4) || ($2 !~ /191081157/) { next } ( $1 = $1 )'

\056 is the dot ( . ) \057 is forward slash ( / ) and \042 is double straight quote ( " )

if it can't even match at all, move onto next row. otherwise, use the power of the field separator to gobble away all the unneeded parts of the line. The $1 = $1 will collect the prefix and the rest of the HTML tags you don't need.

The assignment operation of $1 = $1 will also return true, providing the input for boolean evaluation for it to print. This way, you don't need either match( ) or substr( ) at all.

RARE Kpop Manifesto
  • 2,453
  • 3
  • 11