1

I've got some sample data in the following form and need to extract the email address from it:

from=<user@mail.com> (<-- note that this corresponds to $7)
...
...

Currently I'm using this:

awk '/from=<.*>/ {print $7}' mail.log

However, that is only finding the strings that match the regex expression.

When it comes to printing it out, it still prints out the whole thing (like in the first text box).

fedorqui
  • 275,237
  • 103
  • 548
  • 598
Noob
  • 33
  • 1
  • 7
  • possible duplicate of [awk: access captured group from line pattern](http://stackoverflow.com/questions/2957684/awk-access-captured-group-from-line-pattern) –  Mar 03 '15 at 11:15
  • Can there be multiple strings enclosed in `<` and `>` per line? – anubhava Mar 03 '15 at 11:25

4 Answers4

4

You can use gsub to remove everything around < and >:

awk '{gsub(/(^[^<]*<|>.*$)/, "", $7)}1' file

The key point here is (^[^<]*<|>.*$), a regex that can be split in two blocks --> (A|B):

  • ^[^<]*< everything from the beginning of the field up to <.
  • >.*$ everything from > up to the end of the field.

Test

$ cat a
1 2 3 4 5 6 from=<user@mail.com> 8
1 2 3 4 5 6 <user@mail.com> 8
$ awk '{gsub(/(^[^<]*<|>.*$)/, "", $7)}1' a
1 2 3 4 5 6 user@mail.com 8
1 2 3 4 5 6 user@mail.com 8
fedorqui
  • 275,237
  • 103
  • 548
  • 598
1

Warning: I'm told the regular awk command (often found on non-linux systems) doesn't support this command:

awk '/from=<([^>]*)>/ { print gensub(/.*from=<([^>]*)>.*/, "\\1", "1");}' mail.log

The core of this is the gensub command. Given a regex, it performs a substitution (by default, operating on the whole line, $0), and returns the modified string. The substitute, in this case, is "\1", which refers to the match group. So we find the whole line (with something special in the middle), then return just the special bit.

piojo
  • 6,351
  • 1
  • 26
  • 36
  • 1
    you should state that it's gawk-specific. (MAYBE mawk too). – Ed Morton Mar 03 '15 at 12:14
  • @EdMorton: Nope, mawk doesn't have `gensub()` (checked in v1.3.4). – mklement0 Mar 03 '15 at 14:10
  • Anybody know if there's another way to get capture groups? This seems like such an important feature, I'm surprised if it's so new and nonstandard. – piojo Mar 03 '15 at 14:12
  • 1
    You can take a look at [awk: access captured group from line pattern](http://stackoverflow.com/a/4673336/1983854) – fedorqui Mar 03 '15 at 14:15
  • 1
    @fedorqui Huh, that's pretty unambiguous. No capture groups at all. – piojo Mar 03 '15 at 14:33
  • 1
    wrt capture groups - not without GNU awk. In GNU awk you can use gensub() or match(a,b,grps). In all other awks you need to write something like `re[1]="foo"; re[2]="bar"; for (i=1;i in re; i++) { match($0,re[i]); grps[i] = substr($0,RSTART,RLENGTH) }` to get a simulation of "capture groups". It was a bad mistake to not introduce capture groups in the original awk in the 1970s (since sed already had that functionality) but now sub()/gsub() need to be backward compatible with that old functionality which is why gawk introduced gensub(). – Ed Morton Mar 03 '15 at 15:02
1

GNU grep can handle this nicely if you use a positive look behind:

$ grep -Po '(?<=from=<)[^>]*' file
user@mail.com

This will print anything between from=< and > in file.

Chris Seymour
  • 83,387
  • 30
  • 160
  • 202
  • 1
    The OP isn't trying to pull out the 7th field from many possibly matches in the record, the fact this field matches to `$7` is coincidental. – Chris Seymour Mar 03 '15 at 11:36
  • 1
    I understood you exactly. This was my intention entirely. The OP want to print text between `<` and `>` and this does exactly that with out limiting the field position. – Chris Seymour Mar 03 '15 at 11:57
1

iiSeymour's answer is the simplest approach in this case, if you have GNU grep (as he states).
You could even simplify it a little with\K (which drops everything matched up to that point): grep -Po 'from=<\K[^>]*' file.

For those NOT using GNU grep (implementations without -P for PCRE (Perl-Compatible Regular Expression) support), you can use the following pipeline, which is not the most efficient, but easy to understand:

grep -o 'from=<[^>]*' | cut -d\< -f2
  • -ocauses grep to only output the matched part of the input, which includes from=< in this case.
  • The cut command then prints the substring after the < (the second field (-f2) based on delimiter < (-d\<), , effectively printing the email address only.
Community
  • 1
  • 1
mklement0
  • 382,024
  • 64
  • 607
  • 775