Using awk to pull specific lines from a file

Question

I have two files, one file is my data, and the other file is a list of line numbers that I want to extract from my data file. Can I use awk to read in my lines file, and then extract the lines that match the line numbers?

Example: Data file:

This is the first line of my data
This is the second line of my data
This is the third line of my data
This is the fourth line of my data
This is the fifth line of my data

Line numbers file

1
4
5

Output:

This is the first line of my data
This is the fourth line of my data
This is the fifth line of my data

I've only ever used command line awk and sed for really simple stuff. This is way beyond me and I have been googling for an hour without an answer.

If you are dealing with huge files, as I have had to recently, it becomes necessary to avoid loading the numbers-file into memory. My solution was to sort the numbers-file and only deal with one line number at a time. See edit in my answer. — Thor, Apr 19 '13 at 10:58

score 11 · Answer 1 · answered Aug 29 '12 at 18:59

awk 'NR == FNR {nums[$1]; next} FNR in nums' numberfile datafile

simply referring to an array subscript creates the entry. Looping over the first file, while NR (record number) is equal to FNR (file record number) using the next statement stores all the line numbers in the array. After that when FNR of the second file is present in the array (true) the line is printed (which is the default action for "true").

Thor · Accepted Answer · 2015-11-13T11:50:04.210

10

One way with sed:

sed 's/$/p/' linesfile | sed -n -f - datafile

You can use the same trick with awk:

sed 's/^/NR==/' linesfile | awk -f - datafile

Edit - Huge files alternative

With regards to huge number of lines it is not prudent to keep whole files in memory. The solution in that case can be to sort the numbers-file and read one line at a time. The following has been tested with GNU awk:

extract.awk

BEGIN {
  getline n < linesfile
  if(length(ERRNO)) {
    print "Unable to open linesfile '" linesfile "': " ERRNO > "/dev/stderr"
    exit
  }
}

NR == n { 
  print
  if(!(getline n < linesfile)) {
    if(length(ERRNO))
      print "Unable to open linesfile '" linesfile "': " ERRNO > "/dev/stderr"
    exit
  }
}

Run it like this:

awk -v linesfile=$linesfile -f extract.awk infile

Testing:

echo "2
4
7
8
10
13" | awk -v linesfile=/dev/stdin -f extract.awk <(paste <(seq 50e3) <(seq 50e3 | tac))

Output:

edited Nov 13 '15 at 11:50

answered Aug 29 '12 at 17:06

Thor

45,082
11
119
130

1

Brilliant. One quick question though. The ` - datafile`. I've never seen a dash just floating on it's own in the command line. What does it do? – Davy Kavanagh Aug 29 '12 at 17:20
2

The `-f` switch takes the name of a `sed/awk` script, when the argument to `-f` is a hyphen (`-`) it means that `sed/awk` should read the script from `stdin`. – Thor Aug 29 '12 at 17:24
2

It means that /dev/stdin is used as the file... in this case, awk source code. – kbulgrien Aug 29 '12 at 17:25
Thanks. I twigged that after about a minute looking at it, but it's nice to know I am not totally brain dead. Cheers! – Davy Kavanagh Aug 29 '12 at 17:32

score 1 · Answer 3 · answered Aug 29 '12 at 17:20

Here is an awk example. inputfile is loaded up front, then matching records of datafile are output.

awk \
  -v RS="[\r]*[\n]" \
  -v FILE="inputfile" \
  'BEGIN \
   {
     LINES = ","
     while ((getline Line < FILE))
     {
       LINES = LINES Line ","
     }
   }
   LINES ~ "," NR "," \
   {
     print
   }
  ' datafile

score 1 · Answer 4 · answered Jul 05 '14 at 17:31

I had the same problem. This is the solution already posted by Thor:

cat datafile \
| awk 'BEGIN{getline n<"numbers"} n==NR{print; getline n<"numbers"}'

If like me you don't have a numbers file, but it is instead passed on from stdin and you don't want to generate a temporary numbers file, then this is an alternative solution:

cat numbers \
| awk '{while((getline line<"datafile")>0) {n++; if(n==$0) {print line;next}}}'

score 0 · Answer 5 · edited Aug 07 '21 at 03:12

0

while read line; do echo $(sed -n '$(echo $line)p' Datafile.txt); done < numbersfile.txt

edited Aug 07 '21 at 03:12

Adam Smooch

1,167
1
12
27

answered Jun 12 '14 at 12:00

TestBud

69
1
13

You can edirect the output to an another file by editing the above syntax as follows>> while read line;do echo $(sed -n '$(echo $line)p' Datafile.txt >> output.txt); done < numbersfile.txt – TestBud Jun 12 '14 at 12:01
Sry dont consider the above commandline for redirecting output, use this: while read line;do sed -n '$(echo $line)p' Datafile.txt >> output.txt; done < numbersfile.txt – TestBud Jun 12 '14 at 12:08

score 0 · Answer 6 · edited Apr 03 '19 at 20:00

0

This solution...

awk 'NR == FNR {nums[$1]; next} FNR in nums' numberfile datafile

...only prints unique numbers in the numberfile. What if the numberfile contains repeated entries? Then sed is a better (but much slower) alternative:

sed -nf <(sed 's/.*/&p/' numberfile) datafile

edited Apr 03 '19 at 20:00

jkdev

11,360
15
54
77

answered Apr 03 '19 at 16:00

dce

1
1

Using awk to pull specific lines from a file

6 Answers6

Edit - Huge files alternative

Linked