Use grep to extract the value from the file

Question

I have the file with the following content:

<rdf:RDF
    xmlns:rdf="/www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:foaf="/xmlns.com/foaf/0.1/"
    xmlns:jfs="//abc.net/xmlns/prod/xyz/jfs/1.0/">
  <rdf:Description rdf:about="//alm.com/abc/users/piku">
    <foaf:mbox rdf:resource="mailto:piku@disney.com"/>
    <foaf:nick>piku</foaf:nick>
    <foaf:name>Pallavi Mishra</foaf:name>
    <jfs:archived rdf:datatype="//www.w3.org/2001/XMLSchema#boolean"
    >false</jfs:archived>
    <rdf:type rdf:resource="//xmlns.com/foaf/0.1/Person"/>
  </rdf:Description>
</rdf:RDF>

Hoe can I extract email id 'piku@disney.com' and name 'Pallavi Mishra' from this file using perl or grep.

My piece of code is:

my $Name = `cat abc.json | perl -l -ne '/<j.0:name>(.*)<\\/j.0:name>/ and print \$1'`;
my $EmailAddress = `cat abc.json | grep mailto | awk 'BEGIN{FS="\\"|:"} {for(i=1;i<NF;i++) if(\$i ~ /@/) print \$i}'`;

Why is your XML data in a file named `abc.json`? Is that shell code or Perl code? Make up your mind! — 200_success, Jun 10 '15 at 06:59
I want to extract these two values from the file abc.json in a perl script. — user3616128, Jun 10 '15 at 07:10
possible duplicate of [Extraction of data from a simple XML file](http://stackoverflow.com/questions/2222150/extraction-of-data-from-a-simple-xml-file) — tripleee, Jun 10 '15 at 07:10
I think what @200_success means is, why do you have XML data in a file called `abc.json`? — Borodin, Jun 10 '15 at 07:11

score 3 · Answer 1 · answered Jun 10 '15 at 07:07

You should use a proper XML parser such as XML::LibXML

This short program demonstrates the idea

use strict;
use warnings;
use 5.014;  # For non-destructive substitution

use XML::LibXML;

my $doc = XML::LibXML->load_xml(IO => \*DATA);

my $desc = $doc->find('/rdf:RDF/rdf:Description')->get_node(1);
my $mbox = $desc->find('foaf:mbox/@rdf:resource')->string_value  =~ s/^mailto://ir;
my $name = $desc->find('foaf:name')->string_value;
print qq{"$name" <$mbox>\n};

__DATA__
<rdf:RDF
    xmlns:rdf="/www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:foaf="/xmlns.com/foaf/0.1/"
    xmlns:jfs="//abc.net/xmlns/prod/xyz/jfs/1.0/">
  <rdf:Description rdf:about="//alm.com/abc/users/piku">
    <foaf:mbox rdf:resource="mailto:piku@disney.com"/>
    <foaf:nick>piku</foaf:nick>
    <foaf:name>Pallavi Mishra</foaf:name>
    <jfs:archived rdf:datatype="//www.w3.org/2001/XMLSchema#boolean"
    >false</jfs:archived>
    <rdf:type rdf:resource="//xmlns.com/foaf/0.1/Person"/>
  </rdf:Description>
</rdf:RDF>

output

"Pallavi Mishra" <piku@disney.com>

score 1 · Answer 2 · answered Jun 10 '15 at 07:11

1

Do not try to parse XML using your own Perl string processing. That's a nasty unreliable hack.

Perl is a plenty capable language. You don't need to use shell to help Perl parse XML.

use XML::LibXML;
my $foaf = '/xmlns.com/foaf/0.1/';
my $rdf = '/www.w3.org/1999/02/22-rdf-syntax-ns#';

my $doc = XML::LibXML->new->load_xml(location => 'foof.xml');
my $Name = $doc->getElementsByTagNameNS($foaf, 'name')->[0]->textContent;
my $EmailAddress = $doc->getElementsByTagNameNS($foaf, 'mbox')->[0]->getAttributeNS($rdf, 'resource');
$EmailAddress =~ s/^mailto://;

answered Jun 10 '15 at 07:11

200_success

7,286
1
43
74

I am able to fetch the name using :perl -l -ne '/(.*)<\\/foaf:name>/ and print \$1'`; – user3616128 Jun 10 '15 at 07:32
I have no doubt that you are able to do so. That doesn't make it a _good_ idea. What if, for example, the document contains `{`–escaped characters? XML parsers handle all of those details for you, automatically and correctly. Don't reinvent the wheel poorly. – 200_success Jun 10 '15 at 07:35
1

XML also has a a variety of valid ways of reformatting semantically identical documents. This makes line/regex based parsing break. – Sobrique Jun 10 '15 at 09:22

score 1 · Answer 3 · answered Jun 10 '15 at 07:29

With xmlstarlet:

For the name:

xmlstarlet sel -t -v /rdf:RDF/rdf:Description/foaf:name file

And for the email address:

xmlstarlet sel -t -v "/rdf:RDF/rdf:Description/foaf:mbox/@rdf:resource" file

You could add to the second one the sed statement to remove the mailto part:

xmlstarlet ... | sed 's/^mailto://g'

Use grep to extract the value from the file

3 Answers3