9

I'm trying to delete a specific line from a 12GB text file.

I do not have the sed -i option available on HP-UX, and other options like saving to a temporary file aren't working because I have only 20GB space available with 12 GB already used by the text file.

Considering the space requirement I'm trying to do this using Perl.

This solution works to delete last 9 lines from a file of 12 GB.

#!/usr/bin/env perl

use strict;
use warnings;

use Tie::File;

tie my @lines, 'Tie::File', 'test.txt' or die "$!\n";
$#lines -= 9;
untie @lines;

I want to modify the above code to delete any specific line number.

Borodin
  • 126,100
  • 9
  • 70
  • 144
Vishwanath Dalvi
  • 35,388
  • 41
  • 123
  • 155
  • 1
    Download the file and manipulate it somewhere with better tools and more disk space? Install better tools, even if only in your home directory? – Schwern Apr 27 '18 at 18:04
  • @Schwern I've only terminal access with a few permissions. Thanks. – Vishwanath Dalvi Apr 27 '18 at 18:06
  • @mr_eclair: What about just `perl -ne 'print unless $. == 10'`? where `10` being the line number, you could use it for any line of choice. Or for in-place `perl -i -ne 'print unless $. == 10'` – Inian Apr 27 '18 at 18:10
  • 3
    @Inian That writes a new file and they don't have the space. – Schwern Apr 27 '18 at 18:11
  • @Schwern: Yes updated comment with in-place option – Inian Apr 27 '18 at 18:12
  • @Inian already tried this solution, got an error "disk full" in between. – Vishwanath Dalvi Apr 27 '18 at 18:12
  • 9
    `perl -i` isn't really in-place; it writes a temp file and replaces the original after the script is done. – chepner Apr 27 '18 at 18:12
  • @chepner: Yes I was aware of that, but not sure how else a in-place edit could be done considering the disk constraints – Inian Apr 27 '18 at 18:13
  • 3
    UNIX doesn't support this in general -- the filesystem primitives don't let you do in-place deletes without actually needing to rewrite everything past the point where the deletion takes place. Linux has some new kernel-level primitives (supported by only a very small number of filesystems) to do in-place inserts and deletes of blocks, but even then, your changes need to align to 4kb pages. – Charles Duffy Apr 27 '18 at 18:31
  • 1
    Do you really need to **delete** the line, and not just replace it with NULs? In-place replacement is easy and cheap; it's backfilling the space with content from later in the file that isn't (unless the edit is close to the end). – Charles Duffy Apr 27 '18 at 18:33
  • 1
    Re "*other options like saving the file to temporary file isn't working because I've 20 GB (12 GB already used by text file) space available*", Do you have 20 GB available (which is plenty for using a temporary file, or did you mean to say you only have 8 GB available? – ikegami Apr 27 '18 at 19:19
  • 3
    How about a 16GB USB Memory Stick for $20? – Mark Setchell Apr 27 '18 at 19:38
  • 3
    Is there enough memory (RAM) to read the whole thing? Or, rather, how much is there? – zdim Apr 27 '18 at 19:43
  • 1
    Is this a one-time operation or will it be an ongoing process? – mwp Apr 27 '18 at 19:54
  • Is the line to be deleted always identified by just its line number within the file? Why did you say *"Didn't work"* to **toolic's** suggestion that you `splice` the tied array? What happened that was wrong? – Borodin Apr 27 '18 at 20:18
  • 1
    Deleting the *last* N lines off the end is trivial, because you don't need to backfill -- one could just use `truncate()` to perform it in constant time, after calculating the offset. It's not an example that can be extended to the general-case operation of removing content from anywhere in a file. – Charles Duffy Apr 27 '18 at 20:42
  • This information really should be in a database. A 12GB file will take 30mins just to read through it, and that's not a reasonable access time for most information. – Borodin Apr 28 '18 at 00:59
  • Can you use `awk`? You can probably [do this](https://stackoverflow.com/questions/2112469/delete-specific-line-numbers-from-a-text-file-using-sed) then. – jjmerelo Apr 28 '18 at 09:36
  • What about zipping the file and then `zcat $file | sed ... | gzip > $new_file.gz` followed by `mv new_file.gz file.gz; gunzip new_file.gz`? This could work if the zipped file is smaller than 8GB. – PerlDuck Apr 28 '18 at 11:31

1 Answers1

12

Tie::File is never the answer.

  • It's insanely slow.
  • It can use up more memory than just slurping the entire file into memory, even if you limit the size of its buffer.

You are encountering both of those problems. You encounter every line of the file, so Tie::File will read the entire file and store the index of every line in memory. This takes 28 bytes per line on a 64-bit build of Perl (not counting any overhead in the memory allocator).


To delete the last 9 lines of the file, you can use the following:

use File::ReadBackwards qw( );

my $qfn = '...';

my $pos;
{
   my $bw = File::ReadBackwards->new($qfn)
      or die("Can't open \"$qfn\": $!\n");

   for (1..9) {
      defined( my $line = $bw->readline() )
         or last;
   }

   $pos = $bw->tell();
}

# Can't use $bw->get_handle because it's a read-only handle.
truncate($qfn, $pos)
   or die("Can't truncate \"$qfn\": $!\n");

To delete an arbitrary line, you can use the following:

my $qfn = '...';

open(my $fh_src, '<:raw', $qfn)
   or die("Can't open \"$qfn\": $!\n");    
open(my $fh_dst, '+<:raw', $qfn)
   or die("Can't open \"$qfn\": $!\n");

while (<$fh_src>) {
   next if $. == 9;  # Or "if /keyword/", or whatever condition you want.

   print($fh_dst $_)
      or die($!);
}

truncate($fh_dst, tell($fh_dst))
   or die($!);    

The following optimized version assumes there's only one line (or block of lines) to remove:

use Fcntl qw( SEEK_CUR SEEK_SET );

use constant BLOCK_SIZE => 4*1024*1024;

my $qfn = 'file';

open(my $fh_src, '<:raw', $qfn)
   or die("Can't open \"$qfn\": $!\n");
open(my $fh_dst, '+<:raw', $qfn)
   or die("Can't open \"$qfn\": $!\n");

my $dst_pos;
while (1) {
   $dst_pos = tell($fh_src);
   defined( my $line = <$fh_src> )
      or do {
         $dst_pos = undef;
         last;
      };

   last if $. == 9;  # Or "if /keyword/", or whatever condition you want.
}

if (defined($dst_pos)) {
   # We're switching from buffered I/O to unbuffered I/O,
   # so we need to move the system file pointer from where the
   # buffered read left off to where we actually finished reading.
   sysseek($fh_src, tell($fh_src), SEEK_SET)
      or die($!);

   sysseek($fh_dst, $dst_pos, SEEK_SET)
      or die($!);

   while (1) {
      my $rv = sysread($fh_src, my $buf, BLOCK_SIZE);
      die($!) if !defined($rv);
      last if !$rv;

      my $written = 0;
      while ($written < length($buf)) {
         my $rv = syswrite($fh_dst, $buf, length($buf)-$written, $written);
         die($!) if !defined($rv);
         $written += $rv;
      }
   }

   # Must use sysseek instead of tell with sysread/syswrite.    
   truncate($fh_dst, sysseek($fh_dst, 0, SEEK_CUR))
      or die($!);
}
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • 2
    "*You need the number of lines in the file...*" damn, you're right. Even if you avoid it in the `for` loop, `splice` calls `FETCHSIZE`. – Schwern Apr 27 '18 at 22:57
  • 1
    @Schwern, Simply visiting a line is enough to cache its position, and the OP invariably needs to visit every line (either as part of finding the line to remove, or as part of shifting every line after the "deleted" line). So while avoiding `FETCHSIZE` would save you from reading the file twice, it won't save you any memory. – ikegami Apr 28 '18 at 00:08
  • Interesting approach! I am curious why would the last version be faster than the second? Is it because it uses `read` with a buffer size of 4MB instead of `readline` which I assume uses a buffersize of 8KB? – Håkon Hægland Apr 28 '18 at 04:42
  • 1
    @HåkonHægland, `read` and `readline` are both buffered io, so both should use the same buffer. (8 KiB in newer Perls.) /// I meant to use `sysread`, though you should still get savings from using `read`. Fewer scalars built, and more time in C/less time in Perl. Will switch to `sysread` when I can test. – ikegami Apr 28 '18 at 15:10