1

I want to select a large number of specific rows from large files.

Using perl, I am creating a command of this style (here printing 2nd and 4th rows):

sed -n -e 2p -e 4p $file

And launching it using a system() command.

This works fine, except when the number of rows to select from a file becomes quite larger. It currently works with ~10,000 rows, but not with another file where I want to select ~17,000 rows. Is there a limit in the number of arguments that can be passed to sed? Would there be an alternative UNIX tool to use? Thanks for your help

jul635
  • 794
  • 1
  • 7
  • 13

4 Answers4

4

You must surely have the list of lines that you want somewhere in a file, so let's assume that file is called lines.txt and looks like this:

1
2
4
7

Now, you can do this:

awk 'FNR==NR{wanted[$0]++;next} FNR in wanted' lines.txt file

That says this... FNR==NR means the first set of curly braces only applies to the processing of the lines.txt file, and when processing that, store in the array wanted[] the line number that you want and then move to the next line. The second part, FNR in wanted applies to the processing of your second file called file. It says, if the line number (FNR) is in the array wanted[], then print the line.

Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
2

It's not a sed limit, but a bash command line lenght limit: it is getconf ARG_MAX bytes max (on Linux I have seen values for this value from 131072 to 2621440).

The comment from René Nyffenegger is wise: Perl is the better option for this class of problems, in the *NIX world...

If you describe how you need to select the rows to be extracted (i.e.: from i to j? or a list of specific rows?, some different logic?), it should be easy to give you a code sample...

UPDATE: Below I give you an example for the first use case. Of course, if you give an example of some use case, if a pattern can be found, it should be easy to simplify the solution for the second - more generic - use case...

#!/usr/bin/perl
#
# Print a range of lines from a text file.
# Usage: extract-a-range-of-lines.pl first-line last-line input-file

# use ARGV to verify the number of perl command line arguments
@ARGV == 3 or die "Usage: $0 first-line last-line input-file\n";
my ($first_line, $last_line, $filename) = @ARGV;

open(my $FILE, "<", $filename) or die "Could not read from $filename ($!)"; # open the input file   
# loop through the input file
my $count = 1;
while (<$FILE>) {
  last if ($count > $last_line); # break loop when you get to the last line   
  print $_ if ($count >= $first_line); # print the current line if the line number is greater than first param
  $count++; # increment the line counter
}
close $FILE; # close input file
mpapec
  • 50,217
  • 8
  • 67
  • 127
MarcoS
  • 17,323
  • 24
  • 96
  • 174
1

Here is one way to use awk

awk 'NR~"^(2|8|12)$"' file

This will print line 2,8 and 12


Prints line 2 to 7 and 12

awk 'NR>=2 && NR<8 || NR==12' file

or

awk 'NR~"^([2-7]|12)$"' file
Jotne
  • 40,548
  • 12
  • 51
  • 55
  • Thanks, but actually awk also fails at this: "/bin/awk: Argument list too long". I guess I have to use the perl option... – jul635 Oct 31 '14 at 09:07
  • @jul635 How do you add the argument, can you show some example? If there are some type of pattern, it may be possible to simplify it. – Jotne Oct 31 '14 at 09:08
  • .@Jotne: No specific pattern unfortunately. The rows to select are based on patterns in another file of same size – jul635 Oct 31 '14 at 09:15
  • @jul635 so line number comes from an external file? One line pr row? – Jotne Oct 31 '14 at 09:18
1

one sed command:

use ; separator instead of several -e 2p

sed -n -e '2p;4p;12p' file

if too long for bash command line

create a temporary file with the same structure inside 2p;4p;12p and use -f option

sed -n -f TemporaryFile file
NeronLeVelu
  • 9,908
  • 1
  • 23
  • 43
  • I am trying to do this exact thing, but having the exact problem that when the list of line numbers is too long (for me more than a few hundred) it errors with `sed: 2: TemporaryFile: command expected`. Assuming I want to do this with sed, any idea why this isn't working? I think it may have to do with https://stackoverflow.com/questions/19354870/bash-command-line-and-input-limit but I can't figure out how to correctly wrap it in a shell script to bypass the ARG_MAX. Also, the error doesn't specifically say ARG_MAX, so I'm not even sure that's it. Thank you! – seth127 Nov 22 '17 at 03:11
  • you can also spread the list is several TemporaryFile that are called sequentially `sed -n -f F1 file;sed -n -f F2 file; ...` – NeronLeVelu Nov 23 '17 at 13:46
  • That certainly works, but it’s way slower. Maybe sed just isn’t the right solution, but I’m trying to sample 200k lines from a 200M line file and it’s looking like it’ll take about 5 days. A bit too long. – seth127 Nov 25 '17 at 16:22
  • jsut not to forget to change the 'p' in a 'q' (print current "line" and exit) and optimize with a '1,X d' in followinf files where X is the last line in previous file. – NeronLeVelu Nov 27 '17 at 10:04