2

I have recently been tasked with creating a scripted solution to create resource Data Capture records from an online XML feed.

This is not something that I have done before and would be grateful if anyone could offer any keys points that I should be aware of, any background reading that i could have a look at or any other issues or 'pitfalls' that I should take into consideration when doing this. Terminology that may be specific to this type of task would also be a big help.

Ideally I would like to achieve this using JQuery, or if it would be an easier task to complete, use Perl. My JQuery knowledge is better than my Perl knowledge though.

My aim is to take a very large XML feed from online that comprises of multiple node elements consisting of a variety of content. An example of the XML is below.

<response>
<result name="response" numFound="3559" start="0">
    <doc>
        <str name="PID">islandora:4466</str>
        <arr name="dc.coverage">
            <str>4466</str>
        </arr>
        <arr name="dc.description">
            <str>
                Text
            </str>
            <str>
                <p><iframe src="http:" width="230" height="230" frameborder="0" allowtransparency="65535" scrolling="auto"></iframe></p>
                <p><a href="/assets/.....">Transcript (DOC, 150KB)  </a></p>
            </str>
        </arr>
        <arr name="dc.identifier">
            <str>islandora:4466</str>
        </arr>
        <arr name="dc.subject">
            <str>heav422</str>
            <str>heav533</str>
            <str>heav547</str>
            <str>heav549</str>
            <str>discipline1137</str>
            <str>theme778</str>
        </arr>
        <str name="dc.title">Text</str>
        <arr name="hea.abstract">
            <str> <!-- HTML ready content (example below) -->
                <p>Text</p>
                    <ul>
                        <li>Text</li>
                        <li>Text</li>
                        <li>Text</li>
                        <li>Text</li>
                        <li>Text</li>
                        <li>Text</li>
                        <li>Text</li>
                    </ul>
                <p>Text</p>
            </str>
        </arr>
        <arr name="hea.date">
            <str>2012-05-01 00:00:00</str>
        </arr>
        <arr name="hea.discipline">
            <str>1137</str>
        </arr>
        <arr name="hea.heav">
            <str>422</str>
            <str>533</str>
            <str>547</str>
            <str>549</str>
        </arr>
        <str name="hea.resource_type">808</str>
        <arr name="hea.theme">
            <str>778</str>
        </arr>
        <arr name="hea.title">
            <str>Text</str>
        </arr>
        <date name="timestamp">2013-11-07T08:12:22.684Z</date>
    </doc>
</result>
</response>

Ideally i would like to develop something that would allow me to break the initial large XML into individual XML files for use as data capture records.

My initial thinking behind this is that i could JQuery's $.parseXML to seperate the initial XML into the individual records and then save each as an individual .XML file before putting them into my work CMS and converting them to DCRs (using the functionality of the CMS).

I have done some online looking and there seems to be lots of more complicated ways of doing this and ideally I would be grateful for any guidance as to how to do this.

This is the first time I will have attempted anything like this, and have a deadline that takes this into account. So ideally if anyone could suggest any, hints tips or extra reading then I would appreciate it. This is my initial research stage so as of yet I have not started trying to put together a solution.

If I have missed anything that you would like to know to better advise, please ask and I will endeavor to post the answer ASAP.

Thank you for having a look and any advise that is given.

**Curious to know why this had been marked down without any comment as to why?

Dan

  • You should show us code to look at. What did you do so far? – user1126070 Jan 09 '14 at 14:07
  • @user1126070, hi thanks for having a look. I have not started to try and creat a solution atm as i have only just been tasked with the problem. I am currently still trying to workout the best approach to take and the logic behind it, for completing a task like this. – sirBassetDeHound Jan 09 '14 at 14:11
  • if you gave us a bit more info on the inputs and expected outputs of what you need to write, we could offer you practical advice. Without it there is not much we can do help you. – mirod Jan 09 '14 at 14:21
  • @mirod, thanks for the advice. I have updated the initial question. Let me know if there is anything specific which you feel should be mentioned that i have omitted and I will update again. – sirBassetDeHound Jan 09 '14 at 14:54

2 Answers2

1

You could use xml_split which is part of XML::Twig to do this. If the tool doesn't do what you want, you can use XML::Twig itself to break up the original file the way you want. The module is designed to handle big files.

An other Perl solution is to use XML::LibXML, especially the reader interface in XML::LibXML::Reader.

mirod
  • 15,923
  • 3
  • 45
  • 65
0

For large files are recommend stream style parsing. Right now you are only interested in some tags and the file size is huge (does not fit into memory).

Here is some reading: http://coldattic.info/shvedsky/pro/blogs/a-foo-walks-into-a-bar/posts/55

CPAN module: http://metacpan.org/pod/XML::Twig

Example:

use XML::Twig;
use Data::Dumper;

my $xml=<<ENDOFXML;
... your xml here ...

ENDOFXML

my $index = 0;
my $t= XML::Twig->new( 
                    twig_roots => { 'doc' => 1},
                    no_prolog => 0,
                    twig_handlers => 
                        { doc => \&print_n_purge,                          
                        },
                        pretty_print => 'indented',
                     );
$t->parse($xml);

sub print_n_purge 
  { my( $t, $elt)= @_;
   $index++;
   my $filename = "out-$index.xml";
   open(my $fh,'>',$filename) or die $!;
   $t->flush($fh);          
   close($fh);
   print "created $filename\n";
  }
szabgab
  • 6,202
  • 11
  • 50
  • 64
user1126070
  • 5,059
  • 1
  • 16
  • 15