1

I'm using XML::RSSLite for parsing RSS data I retrieved using LWP. LWP is correctly retrieving in the right encoding but when using RSSLite to parse the data, the encoding seems to be lost and characteres like é, è, à, etc. are deleted from the output. Is there an option to set in order to force the encoding?

Here is my script:

use strict; 
use XML::RSSLite; 
use LWP::UserAgent; 
use HTTP::Headers; 
use utf8; 

my $ua = LWP::UserAgent->new; 
$ua->timeout(10); 
$ua->env_proxy; 
my $URL = "http://www.boursier.com/syndication/rss/news/FR0004031839/FR"; 
my $response = $ua->get($URL); 

if ($response->is_success) { 
   my $content = $response->decoded_content((charset => 'UTF-8')); 
   my %result; 
   parseRSS(\%result, \$content); 
   foreach my $item (@{ $result{items} }) { 
   print "ITEM: $item->{title}\n"; 
   } 
}

I tried to use XML::RSS as it seems to have more option that may be handy in my case but it failed to install unfortunately. :(

brian d foy
  • 129,424
  • 31
  • 207
  • 592
ehretf
  • 163
  • 10
  • That URL gives a 404 Not Found. – daxim Apr 23 '12 at 08:34
  • 1
    If you need help with installing modules, read [What's the easiest way to install a missing Perl module?](http://stackoverflow.com/questions/65865/whats-the-easiest-way-to-install-a-missing-perl-module) and if still necessary, [open a new question](http://stackoverflow.com/questions/ask). – daxim Apr 23 '12 at 08:41
  • Many thanks Daxim four your answer, there was a error during copy/past, I have corrected it and the URL is now fine – ehretf Apr 23 '12 at 10:21
  • This is not the issue here (as daxim pointed out), but note that most XML parsers require an XML document, which means you need to prevent LWP from attempt to partially parse the XML by using `$response->decoded_content(charset => 'none'); ` – ikegami Apr 23 '12 at 17:58

2 Answers2

4

I like that Mojo::UserAgent along with Mojo::DOM already have the support I need without me tracking down the right combinations of modules to use, and it handles the UTF-8 bits without me doing anything special:

use v5.10;
use open qw( :std :utf8 ); 
use Mojo::UserAgent; 

my $ua = Mojo::UserAgent->new; 
my $URL = "http://www.boursier.com/syndication/rss/news/FR0004031839/FR"; 
my $response = $ua->get($URL)->res; 

my @links = $response
    ->dom( 'item > title' )
    ->map( sub { $_->text } )
    ->each;

$" = "\n";
print "@links\n";

I have another example at Painless RSS processing with Mojo

brian d foy
  • 129,424
  • 31
  • 207
  • 592
  • I did run it, but it shows broken characters in place of accents. I do have utf-8 terminal, locale and font. This is how output looks like: http://imgur.com/KyVPJ –  Apr 24 '12 at 11:10
  • 1
    What happens when you run it with `perl -C`? – brian d foy Apr 24 '12 at 13:26
  • All good - i.e. perl -C q.pl shows correct characters. Further tests showed that -CO/-C2 is enough –  Apr 24 '12 at 13:28
3

The RSSLite documentation explicitely states:

Remove characters other than 0-9~!@#$%^&*()-+=a-zA-Z[];',.:"<>?\s

Therefore, the module is hopelessly broken. Try again with XML::Feed

daxim
  • 39,270
  • 4
  • 65
  • 132