Perl and Mechanize: can't get multiple matches for my simple regex

Question

I'm using WWW::Mechanize to query the Twitter API and storing the (XML) results into @content

Now I want to search through that content for user IDs (the data I want is always stored between <id>...</id> tags). The following regex works perfectly on a downloaded file:

for ( @content ) {
  if (m/<id>(\d+)<\/id>/) { 
    print "$1\n";
   }
}

but it won't work on the @content array that I create with Mechanize, when it will only give me a single match.

I've tried using the look between method that I found elsewhere on StackOverflow but that seems to have been a red herring:

m/(?<=<id>)(\d{1,})(?=<\/id>)/g

I'm missing something, but (after years of always finding the answer on StackOverflow or elsewhere) I'm stumped. Clearly I don't even know how to ask the correct question. What am I missing? Is it something to do with the way that Mechanize stores the array?

[The pony he comes...](http://stackoverflow.com/a/1732454/554546) — , May 25 '12 at 19:22
Aside from what @JackManey said, have a look at https://metacpan.org/module/Net::Twitter. It will do the heavy lifting for you. — oalders, May 25 '12 at 19:27
While I am amused by @JackManey's response (am I really contributing to the moral collapse of SO with my question?) I don't think that it addresses my point entirely. I can solve the problem in all sorts of ways -- but none of these will make me wiser about why two (to me) identical arrays (if I `print Dumper(@content);` they seem to be identical anyway) don't work with the same regex. What -- to repeat my plaintive question -- am I missing? Why does the Mechanize content behave differently to the downloaded content? — mediaczar, May 25 '12 at 19:47

daxim · Answer 1 · 2012-05-25T21:37:06.020

3

use 5.010;
use strictures;
use WWW::Mechanize qw();
use XML::LibXML qw();

my $mech = WWW::Mechanize->new;
$mech->get('http://api.twitter.com/1/followers/ids/twitter.xml');
my $dom = XML::LibXML->load_xml(string => $mech->content);

# or skip the middle-man:
# my $dom = XML::LibXML->load_xml(location => 'http://api.twitter.com/1/followers/ids/twitter.xml');

say $_->textContent for $dom->findnodes('//id');

edited May 25 '12 at 21:37

answered May 25 '12 at 21:31

daxim

39,270
4
65
132

This was really v. useful: many thanks. I've been using XML::Simple -- looks like I could really do a lot more in less time with this... – mediaczar May 27 '12 at 17:42

score 0 · Answer 2 · answered May 25 '12 at 19:30

0

For XML you need to use XML parsers. What, if your XML will be like this?

<id param="test">
4
</id>

And you need to dump your @content to see that all correct.

answered May 25 '12 at 19:30

Kostia Shiian

1,024
7
12

Sometimes regexes can quick-and-dirty simple tasks on X/HTML when you're in languages where you really have to work to find a good parser... but in Perl the parsers are clearly easier to use than the regexes themselves. – djechlin May 25 '12 at 19:47
I've edited the question to make it clear that the XML will always be the same. I'm aware that I can use XML::Simple but thank you nonetheless. My question remains, though: why doesn't the Mechanize array behave like the array that I create by reading in the same file from a local download? – mediaczar May 25 '12 at 19:53
@djechlin thanks. That's probably true. This started off as a single line of sloppy shell: `curl http://api.twitter.com/1/followers/ids/twitter.xml | sed 's/<[^>]*>//g' | sed '/^$/d'` and I was unwilling to work much harder. But it's thrown up (for me) an interesting question that I don't really understand. – mediaczar May 25 '12 at 19:56

score -1 · Accepted Answer · answered May 25 '12 at 20:11

-1

Try this:

use strict;
use warnings;
use WWW::Mechanize;
use Data::Dumper;

my $mech = WWW::Mechanize->new();

my $url = "http://api.twitter.com/1/followers/ids/twitter.xml";

$mech->get( $url );

# $mech->content is string

#print Dumper ($mech->content);

my @data = split /\n/ , $mech->content;

foreach (@data)
{
# $VAR4987 = '<id>340750222</id>';
    if (/<id>(\d+)<\/id>/)
    {
        print $1; print "\n";
    }
}

answered May 25 '12 at 20:11

Kostia Shiian

1,024
7
12

`split /\n/ , $mech->content;` is exactly what I needed. I am a gimp; I was effectively creating a single item array (with the whole file as the item.) Many thanks! – mediaczar May 25 '12 at 21:43

Perl and Mechanize: can't get multiple matches for my simple regex

3 Answers3