0

XML File :

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE companies>
<companies>
<company>
<ticker>IBN</ticker>
<title>ICICI Bank Ltd</title>
<address>ICICI Bank Ltd.ICICI Bank TowersBandra-kurla Complex,  Mumbai</address>
<phonenum> 91 22 2653 6157</phonenum>
<faxnum> 91 22 2653 1175</faxnum>
<full_time> </full_time>
<website>http://www.icicibank.com</website>
<sector>Financial</sector>
<industry>Foreign Regional Banks</industry>
<news>Headlines Financial Blogs Company Events Message Board</news>
<sno>0</sno>
<fin_ticker>IBN</fin_ticker>
<marketcap>24.52B</marketcap>
<e_value>24.52B</e_value>
<ret_on_assets>0.74%</ret_on_assets>
<gross_profit>8.94B</gross_profit>
<prof_margin>10.79%</prof_margin>
<last_trade>44.05</last_trade>
<trade_time>Apr 8</trade_time>
<prev_close>44.52</prev_close>
<serialno>0</serialno>
<mgt_ticker>IBN</mgt_ticker>
</company>
<company> ... </company>
<company> ... </company>
<company> ... </company>
<company> ... </company>
</companies>

Perl Code :

use strict;
use warnings;
use XML::Simple;
use Data::Dumper;

my $xmlfile = "sample1.xml";
my $xml = new XML::Simple;
my $data = $xml->XMLin($xmlfile);

#print Dumper($data);
print "$data->{company}{title}\n";

Expected Output : ICICI Bank Ltd

simbabque
  • 53,749
  • 8
  • 73
  • 136
Gourav
  • 15
  • 5
  • When you say "XML File : IBN ICICI Bank Ltd ...", are you saying those are the contents of the file you're trying to parse? That definitely isn't valid XML syntax. XML is structured data with a lot of '<' and '>' signs. See example [here](https://msdn.microsoft.com/en-us/library/ms762271(v=vs.85).aspx). – sferencik Apr 05 '16 at 07:19
  • @sferencik the formatting was broken, there was an empty line missing. I fixed it. Feel free to submit an edit suggestion next time you see something like this. :) – simbabque Apr 05 '16 at 07:24
  • What does your Data::Dumper output say? It looks like you are missing the `` in your `print`. – simbabque Apr 05 '16 at 07:25
  • Have you read STATUS OF THIS MODULE in [XML::Simple](http://p3rl.org/XML::Simple)'s documentation? – choroba Apr 05 '16 at 07:51
  • I added an explanation of why your XML::Simple code wouldn't work--if you care to know – 7stud Apr 06 '16 at 04:40
  • I've written a tutorial called [XML::LibXML by example](http://grantm.github.io/perl-libxml-by-example/) which is intended to help people switch from XML::Simple to XML::LibXML - which is a much better module and simpler to use. – Grant McLean Apr 06 '16 at 21:08

2 Answers2

1

XML::Simple

STATUS OF THIS MODULE

The use of this module in new code is discouraged.

In particular, XML::LibXML is highly recommended and XML::Twig is an excellent alternative.

http://search.cpan.org/~grantm/XML-Simple-2.22/lib/XML/Simple.pm


In any case, the problem with your XML::Simple attempt:

$data->{company}{title}

is that $data->{company} returns an array reference:

use strict;
use warnings; 
use 5.020;
use XML::Simple;
use Data::Dumper;

my $xmlfile = 'xml.xml';
my $href = XMLin($xmlfile);
say Dumper($href);

--output:--
$VAR1 = {
          'company' => [   #<== That means array reference!
                       {
                         'industry' => 'Foreign Regional Banks',
                         'phonenum' => ' 91 22 2653 6157',
                         'trade_time' => 'Apr 8',
                         'ret_on_assets' => '0.74%',
                         'faxnum' => ' 91 22 2653 1175',
                         'website' => 'http://www.icicibank.com',
                         'serialno' => '0',
                         'mgt_ticker' => 'IBN',
                         'title' => 'ICICI Bank Ltd',

                 ...
                 ...

and you cannot access arrays with {...}, like you did:

     array
       |
+--------------+                 
|              |
$data->{company}{title}

Instead you have to access arrays with [...]. The first element of the array is the hash reference, so the hash is at index 0 in the array:

       hash
        |
+-----------------+                
|                 |
$data->{company}[0]

Now, you can use hash access {...} on that hash to get the title:

       hash
        |
+-----------------+                
|                 |
$data->{company}[0]{title}


use strict;
use warnings; 
use 5.020;
use XML::Simple;
use Data::Dumper;

my $xmlfile = 'xml.xml';
my $href = XMLin($xmlfile);
say "$href->{company}[0]{title}";

--output:--
ICICI Bank Ltd

Here it is with XML::LibXML:

1) Using DOM methods:

use strict;
use warnings; 
use 5.020;
use XML::LibXML;
use Data::Dumper;

my $xmlfile = "xml.xml";
my $parser = XML::LibXML->new();
my $doc = $parser->parse_file($xmlfile);
#say $doc;  #outputs the xml

my $root = $doc->getDocumentElement; #=> <companies> tag
my @company_tags = $root->getElementsByTagName('company');
my @title_tags = $company_tags[0]->getElementsByTagName('title');
say $title_tags[0]->textContent();

--output:--
ICICI Bank Ltd

2) Using XPaths:

use strict;
use warnings; 
use 5.020;
use XML::LibXML;
use Data::Dumper;

my $xmlfile = "xml.xml";
my $parser = XML::LibXML->new();
my $doc = $parser->parse_file($xmlfile);
#say $doc;  #outputs the xml

my $root = $doc->getDocumentElement; #=> <companies> tag
my @titles = $root->findnodes("//company/title");
say $titles[0]->findnodes("./text()");

--output:--
ICICI Bank Ltd

The methods:

  1. findnodes()
  2. find()
  3. findvalue()

can be found in the XML::LibXML docs here.

7stud
  • 46,922
  • 14
  • 101
  • 127
1

Please don't use XML::Simple. It lies - it's not Simple at all.

I like XML::Twig as an alternative:

use XML::Twig; 
print $_ -> text,"\n" for XML::Twig -> parsefile ('sample1.xml') -> get_xpath('//company/title');

Will do the trick.

Expanding it out for the sake of clarity:

#!/usr/bin/env perl
use strict;
use warnings;

use XML::Twig; 

my $twig = XML::Twig -> parsefile ( 'sample1.xml' );

foreach my $company ( $twig -> get_xpath('//company') ) {
    print $company -> first_child('title') -> text,"\n";
}

One of the key advantages of XML::Twig and XML::LibXML is that they support xpath - which is sort of like a regular expression for XML.

But it means you can select your company title by specifying:

//company/title
/companies/company/title

// is a wild card 'anywhere in document'. You can also do .// for 'anywhere beneath this element, so something like:

print $company -> get_xpath('.//title',0)->text,"\n"

etc.

Community
  • 1
  • 1
Sobrique
  • 52,974
  • 7
  • 60
  • 101