0

I am new to regular expressions and still learning.

I have an XML file which has a text node followed by clinical information.

In the text node I have Content IDs defined like

<item>
    <content ID="a138134600007">Wellbutrin TABS;</content>
    <content ID="a138134600007-sta"> (Active) </content>
    <content ID="a138134600007-comments"> </content>
</item>

Later on in the lower xml snippet containing the actual clinical data these ids are referenced

              <text>
                <reference value="#al38134600007" />
              </text>

I would like to replace the above text node with the content represented by the id so I would like to transform the file to look like

              <text>
                Wellbutrin TABS;
              </text>

Being a Java developer I am resisting writing a really ugly solution and looking for a more elegant regular expression solution (not to mention the performance since the transformation needs to happen in half a million xml CCDs).

I will like to do it in perl since it is available by default on linux but happy to use any technology that can solve this problem.

Any suggestion?

Thanks in advance, Cheers, Vipin.

  • 2
    When parsing XML, the elegant solution is usually not to use a regular expression :-) http://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg?rq=1 – matt freake Dec 17 '15 at 13:48
  • 1
    Not a good idea to parse xml with text they have xml parser for that – Amen Jlili Dec 17 '15 at 14:00
  • We can't test a possible solution against a GIF. Include a small, complete, testable example of sample input and expected output. – Ed Morton Dec 17 '15 at 14:44
  • Please don't post links to images of code; just post the code, especially since there's so little of it. Can you please edit your question accordingly? – Bohemian Dec 17 '15 at 14:45
  • 1
    *“Being a Java developer i am resisting writing a really ugly solution”* I disagree that the two are in any way related – Borodin Dec 17 '15 at 14:53
  • Are all of the **item** tags found inside a single or different parent tag than the **text** tags? – iPherian Dec 17 '15 at 16:06
  • I ended up using DOM, it was fast enough to process half a million files in less then an hour and that was good enough. Totally agree with peoples response that Regex is totally not meant for this when XML libraries are available. – driftingprogrammer Apr 03 '16 at 09:59

3 Answers3

1

I suggest looking at Java's XML Parsing. As many people said, don't use Regex to parse XML files.

You can also use xmlint (with xpath). I also suggest posting the sample XML file here instead of a GIF img

NinjaGaiden
  • 3,046
  • 6
  • 28
  • 49
1

Because the questioner requested it: With some assumptions, a simple regex can do it.

Assuming the file is free of XML syntax errors, that <content> tags are only found inside <item> tags, that use of whitespace and ordering of attributes is consistent throughout the XML i.e. it is autogenerated, that <text> tags and children cover exactly three lines, and that the xml looks identical to the example in the question:

Item/content nodes

          <item>
            <content ID="a1234"> text </content>
            <!-- more -->
          </item>

Text node

          <text>
            <reference value="#a1234" />
          </text>

Perl code:

Replaces tags according to the question. It handles, and prints out undisturbed, tags other than the ones we're replacing for. (<reference>).

Regex for item/content tags: /<content ID="(.*?)">(.*?)<\/content>/

Regex for text/reference tags: s/(<text>\s*)<reference value="#(.*?)" \/>(\s*<\/text>)/$1.$content{$2}.$3/es

The second regex, which is doing the replacement, grabs values from the %content hash, which is populated earlier.

my %content;

## open filehandles called XIN, XOUT

## stores 3 lines from file, used by second loop
my @block;

while (<XIN>) {
  if (/<content ID="(.*?)">(.*?)<\/content>/) {
    my ($id, $text) = ($1, $2);
    $content{$id} = $text;
  } elsif (/<text>/) {
    ## keep this line for next loop
    push @block, $_;
    ## when we start seeing <text> tags, go to next loop for these
    last;
  }
  print XOUT $_;
}

while (1) {
  ## read up to 3 lines into @block
  for (scalar(@block)+1..3) { my $l = <XIN>; last if (!defined $l); push @block, $l; }
  ## if we've read nothing, we are at EOF
  last if (scalar(@block) == 0);

  my $concat = join '', @block;
  if ( ($concat =~ s/(<text>\s*)<reference value="#(.*?)" \/>(\s*<\/text>)/$1.$content{$2}.$3/es) > 0) {
      print XOUT $concat;
      @block = ();
  } else {
      print XOUT shift @block;
  }
}

Otherwise, just use an XML parser. There are a lot of CPAN modules for it. I like XML::Parser. It doesn't need to load the entire file into memory.

complete perl script

hypothetical input xml

output xml

P.S. The one thing that might not be appropriate to assume is that <content> tags are only found inside <item> tags. But it's a simple change. Will update if OP provides details.

P.S.S The regex is simple ;). The logic is moderately long. If the input XML, including tags not specifically mentioned, doesn't need to be preserved, it would be simpler.

iPherian
  • 908
  • 15
  • 38
  • 1
    Thanks so much i ended up using Java DOM as everyone has suggested, but thanks a lot for providing the answer, i am definitely using it as a learning tool. – driftingprogrammer Apr 03 '16 at 10:00
1

You can achieve the same easily with XML::LibXML, and much more reliably than with regular expression which could hardly handle special characters, escape sequences, new lines and things like that:

my $doc = XML::LibXML->load_xml(IO => \*STDIN); # or stream or file..
foreach my $node ($doc->documentElement()->findnodes("/path/to/your/element/text/reference")) {
    $node->parentNode()->appendText(yourLookupMethod($node->getAttribute("value"));
    $node->unbindNode();
}
$doc->toFH(\*STDOUT, 0); # or stream or file...
Zbynek Vyskovsky - kvr000
  • 18,186
  • 3
  • 35
  • 43