Currently I am an intern at a research group that makes large sets of texts (corpora) searchable. Not only can one search for literal strings, but more importantly it is also possible to look for similar syntactical dependency structures as the given input, without the need of being proficient in any programming language or corpus annotation style. It may be clear that this tool is intended for linguists.
At the start of the project - before I was engaged in the project - the tool was limited to rather small corpora (up to 9 million words). The goal is to make large sets of texts searchable as well. We are talking about +- 500 millions words. Attempts have been made that in theory ought to improve speed by reducing the search space (see this paper) but this has not been tested yet. The results of this attempt is a new file structure. Let's call this structure B, compared to a non-processed structure A. We expect B to provide faster results when queried with BaseX.
My question is: what is the best way to test and compare both approaches with a Perl script? Below you find my current script to query BaseX locally. It takes two arguments. A directory that stores different files. These files each individually store XPaths. Those XPaths are the ones that I have selected to benchmark with. A second argument is the limit of results to return. When set to zero, no limit is set.
Because some parts of the dataset are so incredibly huge, we have divided them in different, equally sized files as well, called treebankparts. They are stored in <tb>
tags inside treebankparts.lst
.
#!/usr/bin/perl
use warnings;
$| = 1; # flush every print
# Directory where XPaths are stored
my $directory = shift(@ARGV);
# Set limit. If set to zero all results will be returned
my $limit = shift(@ARGV);
# Create session, connect to BaseX
my $session = Session->new([INFORMATION WITHHELD]);
# List all files in directory
@xpathfiles = <$directory/*.txt>;
# Read lines of treebank parts into variable
open( my $tfh, "treebankparts.lst" ) or die "cannot open file treebankparts.lst";
chomp( my @tlines = <$tfh> );
close $tfh;
# Loop through all XPaths in $directory
foreach my $xpathfile (@xpathfiles) {
open( my $xfh, $xpathfile ) or die "cannot open file $xpathfile";
chomp( my @xlines = <$xfh> );
close $xfh;
print STDOUT "File = $xpathfile\n";
# Loop through lines from XPath file (= XPath query)
foreach my $xline (@xlines) {
# Loop through the lines of treebank file
foreach my $tline (@tlines) {
my ($treebank) = $tline =~ /<tb>(.+)<\/tb>/;
QuerySonar( $xline, $treebank );
}
}
}
$session->close();
sub QuerySonar {
my ( $xpath, $db ) = @_;
print STDOUT "Querying $db for $xpath\n";
print STDOUT "Limit = $limit\n";
my $x_limit;
my $x_resultsofxp = 'declare variable $results := db:open("' . $db . '")/treebank/alpino_ds'
. $xpath . ';';
my $x_open = '<results>';
my $x_totalcount = '<total>{count($results)}</total>';
my $x_loopinit = '{for $node at $limitresults in $results';
# Spaces are important!
if ( $limit > 0 ) {
$x_limit = ' where $limitresults <= ' . $limit . ' ';
}
# Comment needed to prevent `Incomplete FLWOR expression`
else { $x_limit = '(: No limit set :)'; }
my $x_sentenceinfo = 'let $sentid := ($node/ancestor::alpino_ds/@id)
let $sentence := ($node/ancestor::alpino_ds/sentence)
let $begin := ($node//@begin)
let $idlist := ($node//@id)
let $beginlist := (distinct-values($begin))';
# Separate sentence info by tab
my $x_loopexit = 'return <match>{data($sentid)}	
{string-join($idlist, "-")}	
{string-join($beginlist, "-")}	
{data($sentence)}</match>}';
my $x_close = '</results>';
# Concatenate all XQuery parts
my $x_concatquery =
$x_resultsofxp
. $x_open
. $x_totalcount
. $x_loopinit
. $x_limit
. $x_sentenceinfo
. $x_loopexit
. $x_close;
my $querysent = $session->query($x_concatquery);
my $basexoutput = $querysent->execute();
print $basexoutput. "\n\n";
$querysent->close();
}
(Note that this is a stripped down version and that it may not work as-is. This snippet does not use structure B!)
What happens is: loop through all XPath files, loop through each line in an XPath file, loop through all treebankparts and then execute the sub. The sub then queries BaseX. This comes down to sending a new XQuery to BaseX, and returning the total hits and the results (possibly limited by an argument in the Perl script). So I got that going, but now the question is: how can I improve this script so I can get some benchmarking results out of it.
First of all, I'd start with adding a profiler to this script. I guess that bit is obvious. However, I am not sure how I should start comparing structure A with B. Would I put both queries (to different databases) in separate scripts, then call a profiler on both, and run both scripts a number of times and get a mean value and compare? Or would I run each query by both databases in the same script, almost at the same time?
It is important to consider caching that is happening. Therefore I am not entirely sure what build-up for benchmarking of a database this huge is appropriate. First one script, then the other. Both at the same time. Alternating queries between the two. And so on. There are so many possibilities, but I wonder which would provide the best results. Also, I would repeat the process a couple of times. Would I repeat each query and then continue to the next, or finish all XPath files, and then repeat the whole process again?
(Reading the description of the benchmark-tag I am confident that this - albeit elaborate - post is suited for SO.)