inverted index generation using perl for large data set

Question

My aim is to build an inverted index file in perl: I have file(s) with 10 Million+ lines in the form:

document id:  citing document 1; citing document 2;

Example:

document 56: document 12, document 45
document 117: document 12, document 22, document 99

and I want to create another file in the form:

document 12: document 117, document 56 
...

Currently I am reading the source file(s) line by line, and appending the index file (one line for each document) for each citation. But appending the index file ( In Perl, how do I change, delete, or insert a line in a file, or append to the beginning of a file?) for each citation is very slow. Any alternative/more efficient approach? Thanks.

Please use the code formatting option when showing sample input (and code). — TLP, Nov 22 '13 at 14:14
I would say that if you have 10 million line files, you're either going to need lots of memory to store all the lines before printing, or you need to add to the lines as you go. As such seek and prints would be expensive, to say the least, your only option for the latter solution would be `Tie::File`. — TLP, Nov 22 '13 at 14:17
http://www.perlmonks.org/?node_id=484831 if you don't insist on your own implementation. — mpapec, Nov 22 '13 at 14:17
Thank you very much for all suggestions: Answers of vogomatix and user1534668 put me in the correct direction. I will create one large hash; then parse the source files in smaller chunks (to fit the hash in memory) to create multiple inverted index files for each chunk. Merging those at the end is easy. — user3021974, Nov 22 '13 at 15:17
mpapec: Indeed, lucene approach would have been wiser. I had created my source index files by parsing 20+ Gb of compressed XML files, which I suspect could have been easier with lucene (I used XML::Simple) — user3021974, Nov 22 '13 at 15:19
TLP: Thanks, I try Tie::File to see if speeds up. TLP, vogomatix: I will be more careful with code formatting next time: this was my first post :). Thanks again. — user3021974, Nov 22 '13 at 15:22

score 1 · Accepted Answer · answered Nov 22 '13 at 14:17

1

Instead of modifying the index file adopt the following algorithm:

Load the inverted index file into a hash structure
Read each document and add references into the hash structure
Write the inverted index file.

answered Nov 22 '13 at 14:17

vogomatix

4,856
2
23
46

score 1 · Answer 2 · answered Nov 22 '13 at 14:19

You want to read in the file and build a hash with the data. This should get you started

use strict;
use warnings;
use 5.010;

my %cited; # results go here

while (<DATA>) { # really read from your file
    chomp;
    my ($doc, @cites) = split /:\s+|,\s+/;
    for (@cites) {
        push @{$cited{$_}}, $doc;
    }
}
for (sort keys %cited) {
    say "$_ cited in: ", join ", ", sort @{$cited{$_}};
}

__DATA__
document 56: document 12, document 45
document 117: document 12, document 22, document 99
document 17: document 67, document 22, document 1

inverted index generation using perl for large data set

2 Answers2