0

So I need to do multiple perl regex on a single HTML file, and store each value in an array.

The html file looks like

<a href="/jobs_qa">Job QA</a>

Title:
Commercial Bank 
<p></p>
City:
TX   
State:
TX  
Country:

<p></p>
Full Description:
<p></p>
<p> Citi North America Consumer Banking group serves customers through Retail Banking, Credit Cards, Personal Banking and Wealth Management, Small Business Banking and Commercial Banking.     </p>

<p>Commercial Bank Head - Houston-11030087</p>

<p>Description </p>

<p>POSITION SUMMARY</p>

<p>Lead the sales, relationship, and credit management for commercial banking customers in a given marketplace.  Build and motivate talented relationship teams to effectively penetrate the market and gain market share.  Current business segment includes those clients with revenues from $20 to $500+ million annually.    Clients in this segment typically require more complex product offerings and customized credit decisions made in the field.</p>

<p> </p>

<p> </p>



<p>Qualifications </p>

<p>EXPERIENCE
<br />-MBA or equivalent experience
<br />-Minimum 10 years business and/or commercial banking with increasing levels of responsibility

<p> </p>


<a href="http://www.mysite.com/jobs/">http://www.e.com/jobs/commercial-bank-head-houston-citi-houston-tx</a>
<hr>
Title:
Sr Business Relationship 
<p></p>
City:
CO   
State:
CO  
Country:

<p></p>
Full Description:
<p></p>
<p>Effectively acquires, manages and grows profitable account relationships with an extensive percentage of moderately complex and medium sized business customers that have annual gross sales of generally more than $2MM and less than $20MM. Ensures the overall success & growth of an assigned portfolio by deepening relationships of existing customers and through the acquisition of new customers. 
<p></p>
<a href="http://www.mysite.com/jobs/">http://www.e.com/jobs/sr-business-relationship-mgr-wells-fargo-avon-co</a>
<hr>
Title:
Implementation Associate
<p></p>
City:
WI   
State:
WI  
Country:

<p></p>
Full Description:
<p></p>
<p>Works with project managers and project teams to determine implementation strategy, methods and plans for initiatives that typically impact single systems, workflows or products with low risk and complexity or where work is completed under guidance. Coordinates development of business requirements. Develops standard communication and training plans and materials. Implements communications and training plans. Tracks implementation tasks and budgets, identifies and reports issues or escalates as needed and reports project status. Documents or updates best practices, workflows or procedures. May also be responsible to miscellaneous business administrative initiatives.2+ years experience in one or more of the following: administrative support; project management; implementation; or participation in project teams as part of on-going responsibilities in a postion supporting the line of business.Relevant project management and/or implementation experience- Proven organizational, motivational, time management, prioritization, detail orientation
<br /> and multi-tasking skills. 
<br />- Proven oral and written communication skills to support each line of business. 
<br />- Experience with PC applications - Word, Excel, Access, Power Point and Visio.</p>
<p></p>
<a href="http://www.mysite.com/jobs/">http://www.e.com/jobs/implementation-associate-wells-fargo-milwaukee-wi</a>
<hr>
Title:
......... ... ..... ........ 

...............

And so on - ie I want to group out all content from title to title. i.e. $array[0]= "Title: Commercial Bank <p></p>City:TX ........."
and $array[1]= "Title: Sr Business Relationship <p></p> " and so on and so forth.

I would have approximately 300 such values.

I would also need the HTML tags inside them. As i need to validate the correct usage of the tags. I would not know the contents between the tags

What I have tried is Attempt :

my $i=0;
my @array;
while ($html =~ m/.*(Title:.*?)Title:/ig)
{
    $array[$i]=$1;
    $i++;
}

foreach (@array)
{
    print "$_";
}

But nothing gets absolutely gets picked up. Please advice....

Amey
  • 8,470
  • 9
  • 44
  • 63
  • 3
    Regex is not appropriate for parsing HTML. You're attempting to use a hacksaw to work on a screw. You _might_ get what you want, eventually, but in the process you'll lose a finger or two. – g.d.d.c Oct 21 '11 at 16:31
  • 1
    Why you shouldn't use regex: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Robert P Oct 21 '11 at 16:57
  • But I need the tags, because that is what I am trying to validate- if the tags used are used correctly or not? – Amey Oct 21 '11 at 17:02
  • @perlnewbie What makes you think you lose the tags if you use an HTML parser? It is the validation that's tricky. Use a validator, don't write one from scratch. – Sinan Ünür Oct 21 '11 at 17:08
  • What would this look like if it were correctly generated? – Sinan Ünür Oct 21 '11 at 17:23
  • kk.. seems like many against one ... in terms of technique to do it .... :) Sinan will mark ur response as the answer ... tthanks everyone – Amey Oct 21 '11 at 17:44

1 Answers1

5

Don't use regular expressions to parse HTML. Use an HTML parser. There are many on CPAN. One of my favorites is HTML::TokeParser::Simple.

HTML::Tidy and the W3 validator can help you check HTML documents.

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
  • But I actually need the HTML tags in the array elements, so I will have to use regex and not the CPAN HTML parsers. I am basically validating if the HTML is created correctly – Amey Oct 21 '11 at 16:38
  • So Sinam you are right , there are many malformed tags, and that is what I want to validate for each job description. – Amey Oct 21 '11 at 16:39
  • 1
    @perlnewbie What is it with the spelling of my name that even with the tooltip SO provides people insist on spelling it as Sina **m** ??? – Sinan Ünür Oct 21 '11 at 16:46
  • How is the HTML being generated? Have you considered installing and using the W3 validator? – Sinan Ünür Oct 21 '11 at 16:48
  • my apologies :) Sinan, blame it on my watery eyes.... its blurs the stick of 'a' with 'n' to make it look like a 'm' – Amey Oct 21 '11 at 16:50
  • So the html is genrated using Ruby, using something that I do not know of. My task is pretty much black box style, where I validate the HTML file creation. Is W3 validator something that can help me validate? I do not know of it. So I'll just google it – Amey Oct 21 '11 at 16:52
  • 1
    @perlnewbie Have you not noticed the links in my answer? Also, why would any computer program generate HTML like this? I have only seen this kind of mess when our faculty pages were all manually typed. – Sinan Ünür Oct 21 '11 at 17:07
  • @perlnewbie - to parphrase Sinan's last comment - think outside the box. Your task is easier to accomplish by fixing the Riby HTML printer (and I mean SIGNIFICANTLY easier) than by trying to come up with a not-too-broken way - in Perl or not - of fixing it up post-creation. – DVK Oct 21 '11 at 17:37