3

I'm trying to parse/extract data from an XML file and retrieve necessary data.

For example:

<about>
    This is an XML file
    that I want to
    extract data from
</about>
<message>Hello, this is a message.</message>
<this>Blah</this>
<that>Blahh</that>
<person> 
    <name>Jack</name>
    <age>27</name>
    <email>jack@gmail.com</email>
</person>

I'm having trouble getting the content within the <about> tags.

This is what I have so far:

(<\w*>)[\s*]?([\s*]?.*)(<\/\w*>)/m

I'm simply trying to extract the tag name and content, which is why I have the parentheses there. i.e. ($tag = $1) =~ s/[<>]// to get the tag name, $tagcontent = $2 to get the tag's contents. I'm using \s for the white-space characters (space, tab, newline) and the ? because it may or may not occur * amount of times.

I was testing this through http://www.regexe.com/, and no luck with the matching.

Any help is appreciated. Thanks in advance!

lkisac
  • 2,007
  • 1
  • 26
  • 31
  • See: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?rq=1 – AKHolland Jun 20 '14 at 20:10

2 Answers2

5

XML is not a regular language and cannot be accurately parsed using regular expressions. Use an XML parser instead. That is guaranteed to work in all situations, and will not break if the format of the markup changes in the future.

However, if you're absolutely sure of the format, you could get away with the following regex:

/<(\w+)>\s*(.*?)\s*<\/\1>/s

Explanation:

  • / - Starting delimiter
  • <(\w+)> - The opening tag
  • \s* - Match optional whitespace in between
  • (.*?) - Match the contents inside the tag
  • \s* - Match optional whitespace in between
  • <\/\1> - Match the closing tag. \1 here is a backreference which contains what was matched by the first capturing group.
  • / - Ending delimiter

Note that the s modifier and m modifier are entirely different, and do different things. See this answer for more information about what each does.

Regex101 Demo

Community
  • 1
  • 1
Amal Murali
  • 75,622
  • 18
  • 128
  • 150
  • Yea, I should be using XML::Parser, I thought it might be handy to know how to use regex for this too. Thanks for the very clear explanation! – lkisac Jun 20 '14 at 23:31
  • Glad I could help, @lkisac. Since you're new here, I'll say this: If one of the answers below fixes your issue, you should accept it (click the check mark next to the appropriate answer). That does two things. It lets everyone know your issue has been resolved, and it gives the person that helps you credit for the assist. See [this post](http://meta.stackexchange.com/a/5235/220538) for more information. – Amal Murali Jun 21 '14 at 05:45
  • Thank you. I initially tried to vote up and it wouldn't allow me to without enough reputation. Didn't know I had to click accept on my end. Thanks for the tip! – lkisac Jun 21 '14 at 14:39
  • +1, but how about a `(\S*)` in place of the `(.*?)` to appease the anti-dot-star monster? – zx81 Jun 21 '14 at 23:31
5

I advise you to not try using a regular expression for parsing XML, but to instead use an actual XML Parser.

The following uses XML::LibXML to display the text in the 'about' node. However, another excellent framework is XML::Twig.

use strict;
use warnings;

use XML::LibXML;

my $xml = XML::LibXML->load_xml(IO => \*DATA);

my $about = $xml->findvalue('//about');

print $about, "\n";

__DATA__
<root>
<about>
    This is an XML file
    that I want to
    extract data from
</about>
<message>Hello, this is a message.</message>
<this>Blah</this>
<that>Blahh</that>
<person> 
    <name>Jack</name>
    <age>27</age>
    <email>jack@gmail.com</email>
</person>
</root>

Outputs:

    This is an XML file
    that I want to
    extract data from
Miller
  • 34,962
  • 4
  • 39
  • 60
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Axeman Jun 21 '14 at 05:03