55

In AWK, is it possible to specify "ranges" of fields?

Example. Given a tab-separated file "foo" with 100 fields per line, I want to print only the fields 32 to 57 for each line, and save the result in a file "bar". What I do now:

awk 'BEGIN{OFS="\t"}{print $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45, $46, $47, $48, $49, $50, $51, $52, $53, $54, $55, $56, $57}' foo > bar

The problem with this is that it is tedious to type and prone to errors.

Is there some syntactic form which allows me to say the same in a more concise and less error prone fashion (like "$32..$57") ?

9 Answers9

37

Besides the awk answer by @Jerry, there are other alternatives:

Using cut (assumes tab delimiter by default):

cut -f32-58 foo >bar

Using perl:

perl -nle '@a=split;print join "\t", @a[31..57]' foo >bar
Community
  • 1
  • 1
j.w.r
  • 4,136
  • 2
  • 27
  • 29
28

Mildly revised version:

BEGIN { s = 32; e = 57; }

      { for (i=s; i<=e; i++) printf("%s%s", $(i), i<e ? OFS : "\n"); }
Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • You could get rid of the test in the `printf` by doing a `printf "%s", $s` before the loop, starting your loop at `s+1`, always use OFS as the prefix in the loop, and printing a `\n` after the loop. – jfg956 Nov 15 '12 at 13:21
  • But this solution breaks if you have 2 FS between your fields: it will replace it by a single FS. – jfg956 Nov 15 '12 at 13:21
8

You can do it in awk by using RE intervals. For example, to print fields 3-6 of the records in this file:

$ cat file
1 2 3 4 5 6 7 8 9
a b c d e f g h i

would be:

$ gawk 'BEGIN{f="([^ ]+ )"} {print gensub("("f"{2})("f"{4}).*","\\3","")}' file
3 4 5 6
c d e f

I'm creating an RE segment f to represent every field plus it's succeeding field separator (for convenience), then I'm using that in the gensub to delete 2 of those (i.e the first 2 fields), remember the next 4 for reference later using \3, and then delete what comes after them. For your tab-separated file where you want to print fields 32-57 (i.e. the 26 fields after the first 31) you'd use:

gawk 'BEGIN{f="([^\t]+\t)"} {print gensub("("f"{31})("f"{26}).*","\\3","")}' file

The above uses GNU awk for it's gensub() function. With other awks you'd use sub() or match() and substr().

EDIT: Here's how to write a function to do the job:

gawk '
function subflds(s,e,   f) {
   f="([^" FS "]+" FS ")"
   return gensub( "(" f "{" s-1 "})(" f "{" e-s+1 "}).*","\\3","")
}
{ print subflds(3,6) }
' file
3 4 5 6
c d e f

Just set FS as appropriate. Note that this will need a tweak for the default FS if your input file can start with spaces and/or have multiple spaces between fields and will only work if your FS is a single character.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    It would be something absolutely nice to have in awk! – fred Sep 10 '19 at 15:10
  • @fred theres a million things that would be nice to have in awk but then that leads to a million additional language constructs which results in language bloat and a complete mess of hieroglyphics in every program. If anyone wants that, there's already a tool/language that provides exactly that - https://www.zoitz.com/archives/13. The awk language is based on the idea that there should only be language constructs to do things that are difficult to do with other language constructs - hence a tiny language you can do anything with that's easily readable. – Ed Morton Sep 10 '19 at 15:16
  • Old post, but would this be faster than just using loop? (long lines) – Jotne Nov 04 '19 at 13:03
  • 2
    @Jotne I expect so but I haven't tested for that. I say that because not only is it avoiding the iterations of a loop but by not mentioning any field in the script it's turning off field splitting and for each record ONLY doing `print gensub(,,s,,e,,)` instead of the equivalent of `split(,$0); for (i=s; i<=e; i++) printf "%s%s", $i, (i – Ed Morton Nov 04 '19 at 14:03
  • 1
    @EdMorton Thanks for your reply. I may test it ut if I get time :) – Jotne Nov 05 '19 at 06:44
7

I'm late but this is quick at to the point so I'll leave it here. In cases like this I normally just remove the fields I don't need with gsub and print. Quick and dirty example, since you know your file is delimited by tabs you can remove the first 31 fields:

awk '{gsub(/^(\w\t){31}/,"");print}'

example of removing 4 fields because lazy:

printf "a\tb\tc\td\te\tf\n" | awk '{gsub(/^(\w\t){4}/,"");print}'

Output:

e   f

This is shorter to write, easier to remember and uses less CPU cycles than horrendous loops.

Unixtreme
  • 71
  • 1
  • 1
2

You can use a combination of loops and printf for that in awk:

#!/bin/bash

start_field=32
end_field=58

awk -v start=$start_field -v end=$end_field 'BEGIN{OFS="\t"}
{for (i=start; i<=end; i++) {
    printf "%s" $i;
    if (i < end) {
        printf "%s", OFS;
    } else {
        printf "\n";
    }
}}'

This looks a bit hacky, however:

  • it properly delimits your output based on the specified OFS, and
  • it makes sure to print a new line at the end for each input line in the file.
sampson-chen
  • 45,805
  • 12
  • 84
  • 81
  • Good points (+1) -- but I don't think it needs to get quite so long to accomplish those goals. – Jerry Coffin Nov 15 '12 at 04:29
  • I'm afraid this takes even longer to type than the original version, and it doesn't work as an awk one-liner, so one would need to create an intermediate file -> even more steps. If I'd go this route, I could as well write a Perl script. –  Nov 15 '12 at 05:33
  • @gojira Actually you very much can 1-line this, I just broke it down so you can see what's going on – sampson-chen Nov 15 '12 at 05:48
1

I do not know a way to do field range selection in awk. I know how to drop fields at the end of the input (see bellow), but not easily at the beginning. Bellow, the hard way to drop fields at the beginning.

If you know a character c that is not included in your input, you could use the following awk script:

BEGIN { s = 32; e = 57; c = "#"; }
{ NF = e            # Drop the fields after e.
  $s = c $s         # Put a c in front of the s field.
  sub(".*"c, "")    # Drop the chars before c.
  print             # Print the edited line.
}

EDIT:

And I just thought that you can always find a character that is not in the input: use \n.

jfg956
  • 16,077
  • 4
  • 26
  • 34
  • 1
    Use RS instead of "\n" if you want a character that's not in the input. – Ed Morton Nov 15 '12 at 16:09
  • FYI to delete the first n fields in tab (or any other single character)-separated input where n is a numeric variable would be `sub("([^" FS "]*" FS "){" n "}","")`. That has the advantage in this case of not replacing all of the tabs in the input with spaces as your posted solution would do unless you set `OFS="\t"`. You'd need to set FS to \t too of course. – Ed Morton Nov 15 '12 at 16:16
  • @EdMorton: as RS or FS can more than a single character, I do not think using them in `sub` is the best general solution. – jfg956 Nov 15 '12 at 16:54
  • @EdMorton: you are also right about my solution combining FS. – jfg956 Nov 15 '12 at 16:55
  • RS can only be more than a single character in GNU awk and if you're doing that then you can't rely on "\n" not being part of the record so you'd need a different solution anyway. It's best to use RS as a starting assumption and then modify the script if it proves to be necessary. And yes, you can't use the sub() exactly like that if your FS is an RE which is why I said it only applies to single-char-separated fields. – Ed Morton Nov 15 '12 at 17:46
1

Unofrtunately don't seem to have access to my account anymore, but also don't have 50 rep to add a comment anyway.

Bob's answer can be simplified a lot using 'seq':

echo $(seq -s ,\$ 5 9| cut -d, -f2-)
$6,$7,$8,$9

The minor disadvantage is you have to specify your first field number as one lower. So to get fields 3 through 7, I specify 2 as the first argument.

seq -s ,\$ 2 7 sets field seperator for seq at ',$' and yields 2,$3,$4,$5,$6,$7

cut -d, -f2- sets field delimiter at ',' and basically cuts of everything before the first comma, by showing everything from the second field on. Thus resulting in $3,$4,$5,$6,$7

When combined with Bob's answer, we get:

    $ cat awk.txt

    1 2 3 4 5 6 7 8 9

    a b c d e f g h i

    $ awk "{print $(seq -s ,\$ 2 7| cut -d, -f2-)}" awk.txt

    3 4 5 6 7

    c d e f g

    $
Danny
  • 11
  • 1
0

I use this simple function, which does not check that the field range exists in the line.

function subby(f,l, s) {
  s = $f
  for(i=f+1;i<=l;i++)
    s = sprintf("%s %s",s,$i)

  return s
}
0

(I know OP requested "in AWK" but ... )

Using bash expansion on the command line to generate arguments list;

$ cat awk.txt

1 2 3 4 5 6 7 8 9

a b c d e f g h i

$ awk "{print $(c="" ;for i in {3..7}; do c=$c\$$i, ; done ; c=${c%%,} ; echo $c ;)}" awk.txt

3 4 5 6 7
c d e f g

explanation ;

c="" # var to hold args list
for i in {3..7} # the required variable range 3 - 7
do 
   # replace c's value with concatenation of existing value, literal $, i value and a comma
   c=$c\$$i, 
done 
c=${c%%,} # remove trailing/final comma
echo $c #return the list string

placed on single line using semi-colons, inside $() to evaluate/expand in place.

Bob
  • 1,589
  • 17
  • 25