3

What would be the most efficient way to read the beginning and the end of a huge file (binary or text) in given number of bytes?

Example:


=head2 read_file_contents(file, limit)

Given a filename, returns its partial content in bytes, with number of truncated bytes

=cut
sub read_file_contents
{
    my ($file, $limit) = @_;
    my $rv;

    # Starting and ending number of bytes to read
    $limit = $limit / 2;

    # Reading beginning of file
    my $start;

    # code goes here

    # Reading end of a file
    my $end;

    # code goes here

    $rv = $start . "\n\n\n truncated N bytes of data \n\n\n" . $end;

    return $rv;
}

The main goal is to be able quickly, without processing the whole file, fetch its start and end bytes effectively. It is not a problem to read a whole file and then substr it the needed way but it is not going to work fine with files of size 10 Gb+.

Any solutions would be appreciated.

Ilia Ross
  • 13,086
  • 11
  • 53
  • 88
  • *Note:* StackOverflow nor other communities had clear answers for familiar questions, thus it is more about creating a nice answer, rather than just finding a solution for myself. – Ilia Ross Dec 14 '20 at 14:15

3 Answers3

3
open(my $fh, "<", $file) or die "...";
my $r = read($fh, $start, $limit) or die "...";
die "short read\n" unless $r == $limit;
seek($fh, -$limit, 2) or die "...";
$r = read($fh, $end, $limit) or die "...";
Dave Mitchell
  • 2,193
  • 1
  • 6
  • 7
  • 5
    This would be a great answer if it included a brief explanation and maybe a perldoc reference. – TLP Dec 14 '20 at 13:51
  • 3
    `use Fcntl qw(SEEK_END)` then `seek($fh, -$limit, SEEK_END)` would be better than hardcoding `2` in there – Dada Dec 14 '20 at 14:02
  • Thanks, @DaveMitchell. StackOverflow nor other communities had clear answers for familiar questions, thus it's more about creating a nice answer, rather than just finding a solution for myself. What about counting the number of truncated bytes? Is there any better way, rather than subtraction of _total-limit_? Could you please elaborate this to a nicer formatted/explained answer? – Ilia Ross Dec 14 '20 at 14:10
  • 1
    Downvoe because of magical numbers. Would be happy to change to upvote when this is fixed. – ikegami Dec 14 '20 at 17:57
2

Thanks @DaveMitchell for the insight. Thanks to @ikegami for useful tips. This is what I eventually came up with.

It can be useful for tailing logs (returning reversed output) or previewing files of any size efficiently.

Example:

use Fcntl qw(SEEK_END);

=head2 read_file_contents_limit(file, limit, [opts])

Given a filename, returns its partial content with limit in bytes,
by default collected from both beginning and end of the file 
* Options is a hash reference with
  - [head]      : Head the file only and just return beginning bytes
  - [tail]      : Tail the file only and return ending bytes
  - [reverse]   : Reverse output
  - [nomessage] : Remove truncated message

=cut
sub read_file_contents_limit
{
    my ($file, $limit, $opts) = @_;
    my $data;
    my $reverse = sub {
        return join("\n", reverse split("\n", $_[0]));
    };
    my $nonulls = sub {
        $_[0] =~ s/[^[:print:]\n\r\t]/\ /g;
        return $_[0];
    };

    # Is binary file
    my $binary = -B $file;

    # Open file
    open(my $fh, "<", $file) || return undef;
    binmode $fh if ($binary);

    # Get file size
    my $fsize = -s $file;

    # Return full file if requested limit fits the size
    if ($fsize <= $limit) {
        my $full;
        read($fh, $full, $fsize);
        $full = &$nonulls($full)
          if ($binary);
        $full = &$reverse($full)
          if ($opts->{'reverse'});
        return $full;
    }

    # Starting and ending number of bytes to read
    my $split = !$opts->{'head'} && !$opts->{'tail'};
    $limit = $limit / 2 if ($split);

    # Create truncated message
    my $truncated = $fsize - $limit;
    $truncated -= $limit if ($split);
    $truncated = "\n\n\n[--- truncated ${truncated} bytes of data ---]\n\n\n";
    $truncated = undef if ($opts->{'nomessage'});

    # Reading beginning of file
    my $head;
    read($fh, $head, $limit);

    # Return beginning only if requested
    if ($opts->{'head'}) {
        $head = &$nonulls($head)
          if ($binary);
        $head = &$reverse($head)
          if ($opts->{'reverse'});
        return $head . $truncated;
    }

    # Reading end of file
    my $tail;
    seek($fh, -$limit, SEEK_END);
    read($fh, $tail, $limit);

    # Return ending only if requested
    if ($opts->{'tail'}) {
        $tail = &$nonulls($tail)
          if ($binary);
        $tail = &$reverse($tail)
          if ($opts->{'reverse'});
        return $truncated . $tail;
    }

    # Return combined data
    $data = $head . $truncated . $tail;

    # Remove nulls for binary
    $data = &$nonulls($data)
      if ($binary);

    # Reverse output if needed
    $data = &$reverse($data)
      if ($opts->{'reverse'});
    return $data;
}

The example of how it can be used to tail a log file and show latest log lines on the top.

Usage:

say read_file_contents_limit('/var/webmin/miniserv.log', 2000, {'tail', 1, 'reverse', 1});

Output:

[--- truncated 1092091 bytes of data ---]


10.211.55.2 - root [14/Dec/2020:16:47:37 +0000] "GET /favicon.ico HTTP/1.1" 200 15086
10.211.55.2 - root [14/Dec/2020:16:47:37 +0000] "GET /debug.cgi HTTP/1.1" 200 3662
10.211.55.2 - root [14/Dec/2020:16:47:36 +0000] "GET /favicon.ico HTTP/1.1" 200 15086
10.211.55.2 - root [14/Dec/2020:16:47:36 +0000] "GET /debug.cgi HTTP/1.1" 200 3662
10.211.55.2 - root [14/Dec/2020:16:47:35 +0000] "GET /favicon.ico HTTP/1.1" 200 15086
10.211.55.2 - root [14/Dec/2020:16:47:35 +0000] "GET /debug.cgi HTTP/1.1" 200 3662
10.211.55.2 - root [14/Dec/2020:16:47:34 +0000] "GET /favicon.ico HTTP/1.1" 200 15086
10.211.55.2 - root [14/Dec/2020:16:47:34 +0000] "GET /debug.cgi HTTP/1.1" 200 3662
10.211.55.2 - root [14/Dec/2020:16:47:30 +0000] "GET /favicon.ico HTTP/1.1" 200 15086
10.211.55.2 - root [14/Dec/2020:16:47:30 +0000] "GET /debug.cgi HTTP/1.1" 200 3662
10.211.55.2 - root [14/Dec/2020:16:47:24 +0000] "GET /favicon.ico HTTP/1.1" 200 15086
brian d foy
  • 129,424
  • 31
  • 207
  • 592
Ilia Ross
  • 13,086
  • 11
  • 53
  • 88
  • Tip: Should use open(my $fh, "<:raw", $file)` for binary files – ikegami Dec 14 '20 at 17:21
  • Tip: Use `use Fcntl qw(SEEK_END);` and `seek($fh, -$limit, SEEK_END)` instead of using magical numbers. – ikegami Dec 14 '20 at 17:21
  • Thanks, @ikegami! However, `:raw` is only needed for Windows and does nothing on Linux? Besides, why not using magic number? Is it presumed to be changed? – Ilia Ross Dec 14 '20 at 17:31
  • 1
    Re raw: It can matter on unix too, and it does in every one of my programs. Might as well get in the habit of doing it right. At the very least, it signals your intentions to the reader. /// Re magic numbers, Readable code. The most important thing when programming. Your code will be changed, maintained, debugged, shared, etc. And one needs to be able to read it to do those. In general, they're often subject to change, and might be different on different systems. (I don't think it's the case for these specifically, but ...?) – ikegami Dec 14 '20 at 17:56
  • @ikegami - I am about to edit my answer. Quick question though - speaking of `use Fcntl qw(SEEK_END);` - it should be used outside of the sub, at the top of the file? Would that be okay to use it on the inside of the sub or it would makes no sense, as `use` called at compile time anyway or it would be scoped to the sub execution, if placed inside? If it's called on the compile time and cannot be used on the inside of the sub, would that be alright to use `eval "use Fcntl qw(SEEK_END);"` on it and place it on the inside of the subroutine? What is the theory and best practice says about it? – Ilia Ross Dec 14 '20 at 18:12
  • Re "*it should be used outside of the sub, at the top of the file?*", The imported constant is global, so there's no point in putting it in the function. If you put it inside the function, and you had two functions, you would get a (harmless) "redefined" warning. – ikegami Dec 14 '20 at 18:13
  • @ikegami - thanks! I edited my answer, also added `binmode` as you mentioned before. Is this alright now, I assume? – Ilia Ross Dec 14 '20 at 18:20
  • While `seek`, the crux of the solution, is simple here I'd still like to bring up [File::ReadBackwards](https://metacpan.org/pod/File::ReadBackwards) if you aren't aware of it. (Here is one [example](https://stackoverflow.com/a/62566399/4653379) of its use, that I can easily find; there's much more. This one also has a speed measurement on a large file.) – zdim Dec 14 '20 at 22:18
  • (never mind, `File::ReadBackwards` can't easily help with fetching bytes... I'll leave the comment for a little while longer because it is a useful module to know about but will then remove it) – zdim Dec 14 '20 at 22:38
  • @zdim Thanks, I was aware of it though. Leave it for the future readers. – Ilia Ross Dec 14 '20 at 23:07
  • @IliaRostovtsev Alright then, and I'll leave the other note (saying that that module isn't so useful here) – zdim Dec 14 '20 at 23:17
  • If you read raw bytes from a unicode file, the read could start or end inside of a unicode sequence resulting in corrupted output. – lordadmira Dec 15 '20 at 05:58
0

Check file size, then seek near the end...

https://perldoc.perl.org/functions/seek