0

I usually convert a hash into pulldown. However this time, I want to do the opposite. Does anyone knows how to do that using a regular expression? or any other way. Note that the pulldown contains optgroup and options. I only want the option, so that the id would be the key of the hash, and the value of the pulldown is the value of the hash.

Example if we have a pulldown as follows:

<select>
<optgroup label=fruits>
<option id=1>Apple</option>
<option id=2>Orange</option>
<option id=3>Pineapple</option>
<optgroup label=stuff>
<option id=4>Chair</option>
<option id=5>Board</option>
</select>

I want it to be

1=> "apple", 2=>"Orange",3=>"Pineapple",4=>"Chair", 5=>"Board"
Luci
  • 3,174
  • 7
  • 31
  • 36
  • 3
    Are you trying to ask how to parse HTML with a regular expression? http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Ben Jackson Dec 17 '12 at 14:50
  • If you are trying to load an XML document into a Perl hash, use the XML::Simple module al cpan – Miguel Prz Dec 17 '12 at 14:53
  • 1
    Here are examples of how to parse HTML using Perl: http://htmlparsing.com/perl.html – Andy Lester Dec 17 '12 at 15:01

2 Answers2

3

You don't explain the source of your select element, but I assume it is part of a complete HTML document?

This is best done using HTML::TreeBuilder, which will build a tree structure of your HTML page and allow you to navigate through it.

All this program does is find all option descendants of the first select statement in the page and build a hash using the id attribute and the text value as the key and value of each pair.

I have used Data::Dump only to demonstrate the contents of the final hash.

use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_content(<<'END');

<select>
<optgroup label=fruits>
<option id=1>Apple</option>
<option id=2>Orange</option>
<option id=3>Pineapple</option>
<optgroup label=stuff>
<option id=4>Chair</option>
<option id=5>Board</option>
</select>

END

my $select = $tree->look_down(_tag => 'select');

my %data = map { $_->id => $_->as_trimmed_text } $select->look_down(_tag => 'option');

use Data::Dump;
dd \%data;

output

{ 1 => "Apple", 2 => "Orange", 3 => "Pineapple", 4 => "Chair", 5 => "Board" }
Borodin
  • 126,100
  • 9
  • 70
  • 144
1

I suggest you heed Ben Jackson's warning about parsing HTML with regexes.

However, sometimes you need a quick and dirty solution. You could do something like this:

use warnings;
use strict;

my %options;
while (<DATA>)
{
    if (/^<option\s+id=(\d+)>([\w\s]+)/)
    {
        $options{$1} = $2;  
    }   
}

print "$_: $options{$_}\n" for (keys %options);

__DATA__
<select>
<optgroup label=fruits>
<option id=1>Apple</option>
<option id=2>Orange</option>
<option id=3>Pineapple</option>
<optgroup label=stuff>
<option id=4>Chair</option>
<option id=5>Board</option>
</select>

This makes various assumptions, such as: the option tag never has other attributes in it, it is always at the beginning of a line, option id's are unique to the entire file, etc.

If your input is quite predictable, so that you can make assumptions like that, this should work fine. But if you need a "generic" solution, don't use a regex.

dan1111
  • 6,576
  • 2
  • 18
  • 29