Building indexes for files in Perl

Question

I'm currently new to Perl, and I've stumbled upon a problem :

My task is to create a simple way to access a line of a big file in Perl, the fastest way possible. I created a file consisting of 5 million lines with, on each line, the number of the line. I've then created my main program that will need to be able to print any content of a given line. To do this, I'm using two methods I've found on the internet :

use Config qw( %Config );

my $off_t = $Config{lseeksize} > $Config{ivsize} ? 'F' : 'j';
my $file = "testfile.err";
open(FILE, "< $file")         or die "Can't open $file for reading: $!\n";
open(INDEX, "+>$file.idx")
        or die "Can't open $file.idx for read/write: $!\n";
build_index(*FILE, *INDEX);
my $line = line_with_index(*FILE, *INDEX, 129);
print "$line";

sub build_index {
    my $data_file  = shift;
    my $index_file = shift;
    my $offset     = 0;

    while (<$data_file>) {
        print $index_file pack($off_t, $offset);
        $offset = tell($data_file);
    }
}

sub line_with_index {
    my $data_file   = shift;
    my $index_file  = shift;
    my $line_number = shift;

    my $size;               # size of an index entry
    my $i_offset;           # offset into the index of the entry
    my $entry;              # index entry
    my $d_offset;           # offset into the data file

    $size = length(pack($off_t, 0));
    $i_offset = $size * ($line_number-1);
    seek($index_file, $i_offset, 0) or return;
    read($index_file, $entry, $size);
    $d_offset = unpack($off_t, $entry);
    seek($data_file, $d_offset, 0);
    return scalar(<$data_file>);
}

Those methods sometimes work, I get a value once out of ten tries on different set of values, but most of the time I get "Used of uninitialized value $line in string at test2.pl line 10" (when looking for line 566 in the file) or not the right numeric value. Moreover, the indexing seems to work fine on the first two hundred or so lines, but afterwards I get the error. I really don't know what I'm doing wrong..

I know you can use a basic loop that will parse each line, but I really need a way of accessing, at any given time, one line of a file without reparsing it all over again.

Edit : I've tried using a little tip found here : Reading a particular line by line number in a very large file I've replaced the "N" template for pack with :

my $off_t = $Config{lseeksize} > $Config{ivsize} ? 'F' : 'j';

It makes the process work better, until line 128, where instead of getting 128 , I get a blank string. For 129, I get 3, which doesn't mean much..

Edit2 : Basically what I need is a mechanism that enables me to read the next 2 lines for instance for a file that is already being read, while keeping the read "head" at the current line (and not 2 lines after).

Thanks for your help !

Re "Used of uninitialized value $line in string at test2.pl line 46" The program only has 7 lines! What output do you actually get? — ikegami, Apr 16 '14 at 15:02
I had no issues running your script with a file that has 2GB of data. Using pack "N" with a file over 4GB would be an issue. Use pack "J" (uppercase) to get around that. — imran, Apr 16 '14 at 15:03
@imran, That's wrong. `$Config{lseeksize} > $Config{ivsize} ? 'F' : 'j'` that he's already using is much better. 1) 'j' is more appropriate since seek since takes a signed number. 2) What you suggest won't help at all on 32-bit machines (though I'm not sure that 'F' will either) — ikegami, Apr 16 '14 at 15:06
@ikegami, Are you suggesting that in the statement `$d_offset = unpack("J", $entry);`, `$d_offset`, which is then used in the seek call, will be unsigned? I assumed user is on a 64-bit machine. — imran, Apr 16 '14 at 15:19
@imran, Yes. "J: A Perl internal unsigned integer value (UV)" — ikegami, Apr 16 '14 at 15:20
@ikegami, Yes, "j" would be better to stay consistent with seek and tell, etc. , but `$d_offset` will be a signed integer value (IV) after the line `$d_offset = unpack("J", $entry);`, according to Devel::Peek. — imran, Apr 16 '14 at 15:52
@imran Well I don't what it comes from then.. The file I'm testing on is like 56 Mo, and has, for each line, the number of the line. Are you on Windows ? I'm updating my code so that you have the right lines ! — Jonathan Taws, Apr 16 '14 at 15:55
@ikegami I've updated my code with the correct line, it comes from the print $line (line 10) — Jonathan Taws, Apr 16 '14 at 16:06
Most likely error: `seek` failed. Use `seek($data_file, $d_offset, 0) or die $!;` If it dies, check `$d_offset` against what you expect it to be. — ikegami, Apr 16 '14 at 17:50
@imran, Perl often converts UVs to IVs if the number fits in an IV. Try with a number that doesn't fit in an IV to see the difference. — ikegami, Apr 16 '14 at 17:51
@imran Do you think it's possible to use an array instead of a file to store the indexes ? I won't be using the index file in another program, so I think an array is more interesting. — Jonathan Taws, Apr 17 '14 at 08:56
@Hawknight It is possible to use an array or an in-memory file. It all depends on your use case. The advantage of using an index file on disk is that you would not have to reread the main file every time your program starts up (unless the file changes often). You can just read in the index file into memory. The array or some other data structure in memory would be faster but can bloat the memory footprint of your script. — imran, Apr 17 '14 at 12:42
@imran My thinking is that as I'm using an index file only once and accessing this file in the same program as the one I'm building it my indexes into the file, I suppose using an index file might not be the best solution. The incentive of using a file is, as you said, to keep a low memory footprint. I suppose that if I parse a 4Gb file, using an array can explode the memory footprint of my program ? — Jonathan Taws, Apr 18 '14 at 06:49
@Hawknight it depends on how many lines are in the file you are indexing. 60 million lines can cause your script to use about 2GB of memory. — imran, Apr 18 '14 at 16:37

score 2 · Accepted Answer · answered Apr 16 '14 at 18:04

2

Since you are writing binary data to the index file, you need to set the filehandle to binary mode, especially if you are in Windows:

open(INDEX, "+>$file.idx")
    or die "Can't open $file.idx for read/write: $!\n";
binmode(INDEX);

Right now, when you perform something like this in Windows:

print $index_file pack("j", $offset);

Perl will convert any 0x0a's in the packed string to 0x0d0a's. Setting the filehandle to binmode will make sure line feeds are not converted to carriage return-line feeds.

answered Apr 16 '14 at 18:04

imran

1,560
10
8

I'll check this tomorrow on the right computer, seems like a good shot ! Will keep you updated. – Jonathan Taws Apr 16 '14 at 18:19
Works like a charm, thanks so much ! Still've got a small problem with the uninitialized warning message for one specific file, but it doesn't mess up the result. – Jonathan Taws Apr 17 '14 at 08:01

Building indexes for files in Perl

1 Answers1