1

I have this large xml file (>10 GB) coming from something called blast (genomics). My goal is to produce a smaller xml file by extracting parts of it based on a list of hits (hits.txt).

hits.txt
object1
object2
object130
object958

I want to extract all those hits and therefore build a smaller xml file.

The method I used will extract, for every object in hits.txt, blocks that starts with <Iteration_query-def>objectxxx and that ends with <Iteration> (i.e. the block dedicated to objectxxx in the large xml file).

Here is my code. It works well ($1 is the large xml file, $2 is hits.txt, and $3 is my output xml file):

NAME=$2
while read NAME
do
    sed -n -e '/<Iteration_query-def>'"$NAME"'/,/<Iteration>/p' $1 >> $3
    echo -e "$NAME information has been extracted"   
done  < <(grep . "$2")

NOW, it is extremely slow! It'll take 11 days to process only 5000 objects (around 3 minutes per object in hits.txt file). Would you have a better method???

Sara
  • 933
  • 2
  • 10
  • 15
  • 1
    That's because you're processing the _entire_ input file multiple times, once per section you want to extract. You should be able to do this in one pass using a more powerful scripting language (Perl, Python, AWK, etc). sed is probably the wrong tool for this job. – Jim Garrison Mar 27 '14 at 19:27
  • Is there any particular reason you have tagged this "xml"? – Michael Kay Mar 27 '14 at 20:44
  • 1
    In general, sed is the wrong tool for parsing XML -- XMLStarlet (http://xmlstar.sourceforge.net/), `xmllint --xpath`, or a similar special-purpose tool will do a much, much better job. – Charles Duffy Mar 27 '14 at 20:59
  • Thanks for you answer. I'm a newbie and I was just showing how my scripting knowledge could resolve the problem, knowing someone would propose some xml specific language much more efficient. So that explain the xml tag I guess. So thanks for the tips, I'll look into xmlstarlet. – Sara Mar 29 '14 at 14:43
  • @JimGarrison I already wrote a python parser using ElementTree and pandas which would take out all the wanted data and put it in an table. It works fine. That's not my goal here, I just want to extract blocks of texts and build a shorter xml file. Python wouldn't be of better use than shell scripting in this particular case, right? – Sara Mar 29 '14 at 15:02
  • Can you post sample structure of the source xml? Specifically the relationship of `` and `` elements. Are they siblings or is one a child of the other? – Cole Tierney Apr 02 '14 at 21:38

4 Answers4

2

Processing XML using non-XML-aware tools is always going to lead to grief. Especially when the file is so large that you can't visibly inspect it to see what's going on. For example, a Gb or two into the file there might be data in a CDATA section that happens to match the expression you are searching on; the resulting errors would be very hard to diagnose.

I would tackle this one using SAX. SAX can often be difficult, but here there is very little state to maintain: one bit to say "the last event was an Iteration-query-def start tag", one bit to say "switch copying on/off". You write an XMLFilter implementation that switches copying on for a text node that matches one of your keywords provided the last event was an Iteration-query-def start tag, and that switches copying off when it sees an Iteration start tag; for every other event you just copy it to the result if copying is switched on.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • Perhaps give an example or a starting point of how to parse the xml in the question using SAX – bdrx Mar 29 '14 at 02:22
  • Sorry, no time. I never spend more than five minutes answering a question, unless it's about my own product. – Michael Kay Mar 29 '14 at 13:15
  • Thanks for pointing at SAX. It does look not the clearest language, but I'll look into it. Thanks anyway for your five minutes ;) – Sara Mar 29 '14 at 14:56
1

You will want to process the entire file just once instead of once for each hit you are searching for. You can use awk,python,perl,etc, but you want to read your list of hits just once and read your large file just once.

Assuming the iteration objects cannot be nested using perl you can do something like this

#!/usr/bin/perl -w
use strict; use English;
use IO::Handle;

sub run
{
   my $outputFile = ‘output.xml’;
   my $largeFile = ‘laregInput.xml’;

   open(my $hitsfile, ‘<’, ‘hits.txt’) or die “could not open hits.txt: $!”;
   my @hits = <$hitsfile>;
   close($hitsfile) or warn “error closing hits.txt: $!”;
   my $hitsRegex = buildHitsRegex(@hits);

   my $iterationStopRegex = qr/<Iteration>/;

   open(my $infile, ‘<’, $largeFile) or die “could not open $largeFile: $!”;
   open(my $outfile, ‘>’, $outputFile) or die “error opening $outputFile for writing: $!”;
   $outfile->autoflush; # enable auto flushing to see output in the file it is written not just at after closing the handle

   my $printlines = 0;
   while(my $line = <$infile>)
   {
      if(!$printlines && $line =~ $hitsRegex)
      {
         $printlines = 1;
      }
      if($printlines)
      {
         print $outfile $line;
      }
      if($printlines && $line =~ $iterationStopRegex)
      {
         $printlines = 0;
      }
   }
   close($infile);
   close($outfile);
}

sub buildHitsRegex
{
   my @hits = @ARG;
   my $firstHit = shift(@hits);
   my $hitsRegexStr = '<Iteration_query-def>(?:' . $firstHit;
   for my $hit (@hits)
   {
      $hitsRegexStr .= "|$hit"
   }
   $hitsRegexStr .= ')';
   return qr/$hitsRegexStr/;
}

sub matchesHit
{
   my ($line, $hits) = @ARG;
   my $iterationStartRegex = qr/<Iteration_query-def>/;
   for my $hit (@{$hits})
   {
      if($line =~ /$iterationStartRegex$hit/)
      {
         return 1;
      }
   }
   return 0;
}

run();

If you are only trying to match each hit once you could also remove the hit from the list or regex after matching it.

If <Iteration_query_def> is always at the beginning of the line then you can also optimize some by adding ^ to the beginning of the regex indicating the line must begin with <Iteration_query_def>. Such as

my $hitsRegexStr = '^<Iteration_query-def>(?:' . $firstHit;

The same applies for <Iteration>

my $iterationStopRegex = qr/^<Iteration>/;

If <Iteration> is always on its own line you can also add a $ to match the end of line.

my $iterationStopRegex = qr/^<Iteration>$/;
bdrx
  • 924
  • 13
  • 31
  • Thanks, I tried to run the code, but after 5 minutes running, output.xml is empty. I only had 2 warnings: `Name "main::printLines" used only once: possible typo at parser.pl line 32.` and `Name "main::ARG" used only once: possible typo at parser.pl line 43` – Sara Mar 29 '14 at 15:19
  • There was a typo at line 32 where printLines was used instead of printlines with lower case l. I moved use strict; use English to a separate line to make it more clear. You need to have those enabled. Having strict enabled would have caught the error(s). use English enables the use of @ARG instead of @_; – bdrx Mar 31 '14 at 19:19
  • If you want to see the output in the file as it writes and not just at the end, then you will need to flush the output. See http://stackoverflow.com/questions/4538767/how-flush-a-file-in-perl – bdrx Mar 31 '14 at 19:35
  • Thanks, the code works but it is still too slow fo9r my gigantic xml file. – Sara Apr 02 '14 at 16:30
  • You can try combining your hits into one regex at the beginning and then match on that. If you are searching for a large number of hits then it may speed up the performance I will update the code with an example – bdrx Apr 02 '14 at 17:54
  • Did you try any of the suggested optimizations and if so did they make any difference? – bdrx Apr 07 '14 at 17:24
0

Assuming there is whitespace following <Iteration_query-def>objectxxx, then try this:

gawk '
    NR==FNR {name[$1]; next}
    ENDFILE {RS="<Iteration_query-def>"}
    $1 in name {print RS $0}
' hits.txt large.xml

Requires GNU awk version 4.

glenn jackman
  • 238,783
  • 38
  • 220
  • 352
  • Thanks Glenn, I do have a space. I ran gawk but nothing happens, it just stops after running 5 minutes. – Sara Mar 29 '14 at 14:52
0

I would check out perl's XML::Twig. It's initialized with a set of xpath expressions and corresponding subroutine refs to call for each xpath match. Here an example:

#!/usr/bin/perl -w

use XML::Twig;

my $hits_path = shift;
my $xml_path = shift;

open HITS, "<$hits_path" || die "could not open $hits_path: $!";
my @hits = <HITS>;
close HITS;

my $twig = new XML::Twig(TwigHandlers => {
    '//Iteration_query-def' => \&process_query_def
});

print "<?xml version=\"1.0\"?>\n<smaller>\n";
# parse the xml
if (defined $xml_path) {
    $twig->parsefile($xml_path);
} else {
    # if no path, parse stdin
    $twig->parse(\*STDIN);
}
print "</smaller>\n";

sub process_query_def {
    my ($tree, $elem) = @_;
    my $text = $elem->text;
    my $first_word = $text;
    $first_word =~ s/\s*([^\s]+).*/$1/s;
    if (grep(/$first_word/, @hits)) {
        print "<Iteration_query-def>$text</Iteration_query-def>\n";
    }
}

Sample usage:

genomics.pl ~/tmp/hits.txt ~/tmp/genomics.xml > ~/tmp/smaller.xml
Cole Tierney
  • 9,571
  • 1
  • 27
  • 35