1

I need to extract data from an HTML document and compose an XML document with only interesting information. The way I'm doing this is by transforming the HTML doc into an XML doc, step by step. I have the 5 outermost XML tags in one line each, now I'm trying to structure what's inside of those.

I have a line that's structured this way :

   <myTag> 
      blablabla <a href="link/I/want" *some css* > title I want </a> some other stuff <a href="link that/I/don't/want" *some css*> text I don't want </a> blablabla 
   </myTag>

What I want is :

    <myTag>
    <link>link/I/want</link>
    <title> title I want </title>
    </myTag>

The regex I have is :

    /a href="(.*)"(.*)>(.*)<\/a>/ 

hoping to get #$1 = url , $2 = whatever , $3 = title.

This isn't working because it's taking this instead:

    <myTag>
    <link>link/I/want *some css* > title I want </a> some other stuff <a href="link that/I/don't/want" *some css*</link>
    <titl>text I don't want</title>
    </myTag>

How do I extract the content of the FIRST anchor tag of the line ?

Thanks !

Myna
  • 569
  • 2
  • 10
  • 24
  • [Regexes are not the appropriate tool for parsing HTML.](http://stackoverflow.com/a/1732454/1633117) – Martin Ender Oct 03 '12 at 21:17
  • Please don't use regular expressions for parsing HTML. Here's how to do it right in Perl: http://htmlparsing.com/perl.html – Andy Lester Oct 03 '12 at 21:20
  • Even for a line of HTML only ? :( OK I will look into these. But I feel like a regex is not too bad. Because it's not a piece of info in the whooole html file that i'm looking for, but only in a little line. Let me know if this is really really bad to do for my situation. Thanks for the dedication I really appreciate. :D – Myna Oct 04 '12 at 07:44

1 Answers1

3

Just use non-greedy expressions:

/a href="(.*?)"(.*?)>(.*?)<\/a>/

Note ? after each *.

Igor Chubin
  • 61,765
  • 13
  • 122
  • 144