4

Given a text file named "people.txt" which consists of:

Anne
Bob
Carl
Daphne
Erwin
Gary
Heather

How can I use a sed command or similar one-liner which specifies only a set of non-consecutive line numbers in order to filter the contents to this result:

Bob
Erwin
Heather

(Note: ignore the fact that they are alphabetical)

Note that the real file I am using has over 100K lines, so the answer should consider efficiency.

I know I can use:

sed '5q;d' people.txt 

to get only line 5 ("Erwin"), but is there a variation on this argument in which I can specify a list of arbitrary line numbers?

I think this is possible with sed alone, but even after reading through man sed I am having trouble figuring this out. I've been looking at other answers that come very close to doing this but nearly all of them deal with getting either just a single line or contiguous lines (a range of lines), or that use a more complicated bash script; for instance, "Quick unix command to display specific lines in the middle of a file?" and "How can I print specific lines from a file in Unix?".

Community
  • 1
  • 1
Adam Friedman
  • 520
  • 6
  • 20

3 Answers3

10

You can ask for specific lines by number like this:

sed -n '1p;5p;7p' my_file

The -n flag means, "don't print lines by default", and then for each line you want, you specify the line number and the p (print) command.

larsks
  • 277,717
  • 41
  • 399
  • 399
  • Ah, that is probably what the OP wanted. Couldn't descipher it before I saw your answer :-) – Fredrik Pihl Feb 09 '15 at 19:57
  • Great, thank you! I figured it was something simple like that. I will test this out today, but perhaps you know off-hand: will this work well if I am picking out 50K lines from a 200K-line file? – Adam Friedman Feb 09 '15 at 20:35
  • Good answer - especially as it includes explanation of what's going on. Some `sed` features can be a little inscrutable :). – Sobrique Feb 09 '15 at 22:21
2
$ awk -v lines="2 4 7" 'index(" "lines" "," "NR" ")' file  
Bob
Daphne
Heather

$ awk -v lines="3 5" 'index(" "lines" "," "NR" ")' file  
Carl
Erwin

The blank chars around lines and NR in the above are necessary so that NR value 9 doesn't match when lines contains 19, for example.

If you don't mind hard-coding the line numbers inside the script you could alternatively do:

awk 'NR~/^(2|4|7)$/' file
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • @AdamKatz - the edit you made yesterday to remove the blank chars around the index() args broke the script in the above answer and I had to change it back just now. Please don't edit my awk answers as I generally know what I'm doing. Please do feel free to leave me a comment if you think an answer I posted is wrong and I'll be happy to explain or fix it.. – Ed Morton Feb 10 '15 at 13:37
  • interesting (and my apologies on that). It worked for me when I tested it. Can you explain why those quotes are necessary? All they do is concatenate spaces onto both ends of your lines. – Adam Katz Feb 10 '15 at 19:20
  • No problem. You need to always have blanks around NR to avoid false matches (NR=`9` falsely matches the 19 in lines=`3197`, but NR=`9` does not) and since you need blanks around NR, you need to make sure there's always a blank at the start and end of lines too (NR=`3` undesirably does not match lines=`3197`, but does match lines=`3197`). Code that produces the expected output from a given input set is always trivial to write, it's code that won't produce unexpected output for other inputs that gets complicated. – Ed Morton Feb 10 '15 at 20:08
  • ah. Then why not do `awk -v lines=" 2 4 7 " 'index(lines, " " NR " ")' file`? Also, I'm still getting just 2,4,7 from `seq 1 999 |awk -v lines='2 4 7' 'index(lines,NR)'` using mawk 1.3.3 or gawk 4.1.1. – Adam Katz Feb 10 '15 at 20:28
  • It's better to not rely on the user of the script to pad with blanks when you can add them inside the script. In general `lines` will not be hard-coded - if it was you wouldn't need it at all you could just hard-code the first arg to index(). The failure occurs when you have 2-digit (or more) numbers inside `lines` (e.g. 19) ,because then a single-digit NR e.g. (1 or 9) will match on either of the digits. Try `awk -v lines="19" 'index(lines,NR)' file` on a file with 20 lines, then `awk -v lines="19" 'index(lines," "NR" ")' file` and then `awk -v lines="19" 'index(" "lines" "," "NR" ")' file`. – Ed Morton Feb 10 '15 at 21:07
  • 1
    Thanks for the explanation. I hadn't realized that `index()` was a substring function. I might suggest something more like `awk 'NR == 2 || NR == 4 || NR == 7' file` in that case, as it's more straightforward, but using `index()` is certainly quite clever. – Adam Katz Feb 10 '15 at 21:54
  • `awk 'NR == 2 || NR == 4 || NR == 7' file` would be fine to check for 3 hard-coded line numbers but if that was what was wanted I'd use `NR~/^(2|4|7)$/` (which I just added to my answer) as the former isn't a good general solution as it's around 90% redundant code and it doesn't do what the OP requested which for a `one-liner which specifies only a set of non-consecutive line numbers`. – Ed Morton Feb 10 '15 at 21:59
0

Dynamically generate the sed program:

store the lines you want in an array:

$ lines=(2 5 7)
$ sed -n "$(printf "%dp;" "${lines[@]}")" file
Bob
Erwin
Heather

or if the line numbers are in a file:

$ sed -n "$(sed 's/$/p/' numbers)" file
glenn jackman
  • 238,783
  • 38
  • 220
  • 352